Predict
Predict
Arkadiusz Paterek
Jun 19, 2012
Contents
Abstract 5
1 Introduction 7
3
6 Applications 165
6.1 SVD-based recommender system . . . . . . . . . . . . . . . . . . . . . . . . 166
6.2 Using distance between items . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.2.2 2D visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.2.3 2D recommender system . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.3 Beyond recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7 Summary 177
Bibliography 183
4
Abstract
This monograph describes author’s large experimental work on one machine learning task
– prediction of movie ratings in the Netflix Prize dataset. The main objective of the
experiments was to obtain maximally accurate prediction, as evaluated by hold-out RMSE,
but also important was the perspective of applying the developed methods in recommender
systems. The publication has two goals: summarizing the understanding of the subject due
to the published work of many people on the same task, and presenting some novel insights.
Reaching a good understanding of one task and one dataset gives hope to generalize on
other prediction tasks, as similar challenges recur in analyses of any datasets.
The idea of collaborative filtering is to make use of relations between tasks (users
in our data), and between task attributes (items in our data). Collaborative filtering
methods are used in recommender systems to calculate personalized recommendations, or
in other words, to identify items preferred by a particular user. To realize that goal, a good
intermediate task is prediction of user ratings, and the most accurate models for this task
are based on dimensionality reduction, describing each item by a small number of variables,
which can be seen as automatically learned analogues of movie genres, and a small number
of variables describes each user’s taste. One the most accurate models, regularized SVD,
was analyzed more closely, and the assumptions of that model, such as the single-variable
output, combining hidden variables by multiplication, and using Gaussian priors, were
critically examined. In addition, an interpretation of the learned features by naming new
movie genres has been proposed.
To learn the parameters in the developed models the best predictive accuracy was ob-
tained by using different degrees of approximation of the Bayesian approach, from MCMC
and Variational Bayes, to neural-networks-like simplifications. When identifying the model,
that is, while approaching the unknown probabilistic model that generated the data, good
engineering practice was maintaining a blend of an ensemble of many accurate, but varied
methods. Blends of large ensembles also gave the best reached accuracy, indicating that,
despite the large combined effort of many people, the process of model identification for
the analyzed data remained largely unfinished, which is probably an unavoidable situation
in an analysis of real-life datasets.
The work is complemented by giving heuristics adapting rating prediction to generate
lists of recommendations, heuristics for cold-start situations, and descriptions of two SVD-
based recommender systems.
5
Acknowledgements
To John Tomfohr for sharing details about his very accurate Bayesian model. The
knowledge of this model influenced my experiments and this book the most, and it helped
me better understand the fundamentals of what is important in predictive modelling.
To Piotr Pokarowski. A large part of the methodology of statistical data analysis I use
I learned on his course.
To Andriy Mnih for answering my questions.
To Morgan Stanley PDT group, where with John we gave a talk about our Netflix
Prize solutions, for the hint about “Bayesing users” more heavily.
To all people that helped me on my journey to understand machine learning and science
in general.
To friends, without whose support this book would never have been finished.
6
You know exactly what we have to do -
To give them something with a kick in it,
So hurry up and make a decent brew.
Don’t leave it till tomorrow, stick at it -
Today will pass you by before you know.
You’ve got to grab your chance, or else it’s gone,
It doesn’t come round twice, so don’t be slow,
And once you’ve taken it, don’t let it go;
That’s the only way to get things done.
Faust, J.W.Goethe 1
1 Introduction
One of the biggest needs of our times is to make use of data, which enormous amount is
gathered in digital form. It would be good if the level of development and understanding
of the science, craft and art of data analysis matched the importance of that domain. It is
no exaggeration to say that the collection, understanding, and use of data are large parts
of every area of science (for applied sciences are the foundation, but are also needed in
mathematical, abstract sciences), that is why even small developments in the field of data
analysis can have a large impact, be useful across many domains.
A large fraction of emerging tasks of data analysis, for which there is a need of precise
defining and solving, are prediction tasks, that is reasoning about the unknown, unobserved
part of reality on the basis of the observed part of reality in the form of gathered data.
This work will be devoted to one chosen prediction task, with a single dataset. It should be
emphasised that there is no known optimal approach to the analysis of real-life datasets.
The laws of physics discovered based on observed data (both passive gathering and con-
trolled experiments) describe only small parts of reality, and do it imprecisely. Similarly
so, in prediction tasks, like the one described in this work, one has to be prepared that the
theory and practice developed so far do not describe accurately all phenomena encoun-
tered, do not show right away the best approaches to prediction for analyzed real-life data.
We expect that it is necessary to take an approach to a large extent phenomenological,
guided by experiments and discovery.
Because prediction is so prevalent and useful, there is a need to understand this field
thoroughly. The overall concept for the research described in this work is to examine one
prediction task really well (typically, little time is devoted to most prediction tasks, and
examination is shallow). The conditions needed to perform a thorough and useful exam-
ination are: having a large, good quality dataset, a fixed prediction task and evaluation
criterion for solutions, and involvement of many experts for a long time. Such an occasion
was the Netflix Prize contest, where a dataset of over 100 million ratings was released, and
a task of predicting unknown ratings was proposed, with accuracy evaluated by RMSE
(root mean squared error). During several years since making the data available many
professionals in various fields faced the same task (the contest website summarized that
over 5000 teams submitted their solutions), and a large part of the developed methods
was published. This work, besides presenting novel material, attempts to summarize the
published efforts of many people put into investigating the Netflix task. The work de-
1
Transl. of Faust by John R. Williams
7
scribes the most accurate methods found, and the methods resulting in the best combined
accuracy of the ensemble, but important for us will be also the context of using the meth-
ods in recommender systems and opportunities to generalize on other prediction tasks.
We are also interested in optimizations on the meta-level: minimizing analyst’s time and
effort, improving the general engineering process of approaching the possibly most accu-
rate prediction, model identification, how to quickly identify all important effects, finding
convenient ways to simplify and automate.
Besides the potential to develop the field of prediction, a large advantage of the chosen
task of rating prediction is usefulness of accurate solutions. Ratings are one of the best
known indicators of user’s preferences for items of any kind. Ratings have the advantage,
that they allow to indicate both positive and negative user preference for an item, and
that it is explicit feedback, where the collected preferences are close to the real preferences,
to what the user really thinks. The alternatives are implicit (passively gathered) types of
feedback, like clickstream, which are easier to collect in large amounts, but their common
drawbacks are: their meaning is not always clear, they introduce privacy anxieties, and,
when used in search and recommender systems, they are often prone to loopback effects,
reinforcing regions of lower data quality. In short, explicit feedback data is easier to use
correctly. Prediction of user preferences is a foundation of recommender systems, which
are gaining much popularity lately. Recommender systems are becoming, next to search, a
basic methodology of information retrieval – a domain, which overall objective is helping
to save user’s time needed to reach information that interests him. Recommender sys-
tems realize the possibility of presenting relevant items without specifying a query by the
user, but based on his past activity. It is likely that in the future the most efficient sys-
tems of information retrieval will be hybrids of personalized recommendations and search
(two views, essentially the same, are personalized search and recommendations filtered
by search). To calculate rating-based personalized recommendations a good intermediate
task is prediction of unknown user ratings for items. In this work the prediction accuracy
is evaluated by RMSE (root mean squared error) on a held-out test set, and we focus on
optimizing this criterion.
Let’s briefly introduce to the methodology of prediction used in this work. Various
approaches to capturing uncertainty in data were proposed in the past, but the most
appropriate, most fundamental approach is based on probability theory and Bayes law,
where the collected data is used to calculate the posterior probability of model parameters:
p(θ|D) ∝ p(D|θ)p(θ)
where D is the data, and θ are the parameters of the model. We can note that (apart
from trivial models) no matter how much data was collected, there is always uncertainty
left about what is the exact form of the real model, about which we can say that it
generated the data. A large part of the difficulty of prediction tasks is to identify the un-
known model accurately enough. The identified and learned model is used in the so-called
decision-theoretic approach to prediction, that is the predicted value is chosen so as to
minimize the expected loss with respect to the posterior distribution of model parameters.
This approach can be simplified by attempting to minimize the expected posterior loss
directly, skipping the step of defining a probabilistic model. Approximations of Bayesian
inference and of the decision-theoretic approach create a range of prediction methods,
which are the foundation of a family of related disciplines: machine learning, statistical
data analysis, predictive modelling, predictive analytics, data mining, knowledge discov-
ery in databases, pattern recognition, artificial intelligence, and others. Efficient simpli-
fications of the Bayesian approach are sometimes inspired by biology, as the methods of
neural networks, or can be inspired by physics, as the model of Boltzmann machines, the
method of Gibbs sampling, or optimization methods by simulated annealing. Prediction is
8
an art to a large degree experience-based, makes use of insights recurring in many different
tasks and datasets, like the fact that the most observed distribution in nature is Gaussian
distribution. Identification of the probabilistic model, choice of assumptions about the
distributions, outliers, measurement error, missing values, are sufficiently difficult for data
directly observed. The more difficult is to choose all assumptions for models with layers
of hidden (latent) variables, such as the models being the core of this work, containing
hidden variables representing user preferences and item features.
The methods solving the Netflix’s task of rating prediction belong to the family of
collaborative filtering methods. The concept of collaborative filtering evolved from the
initial idea of filtering mailing lists by multiple moderators, hand-chosen by every user,
to the current query-less model-based approaches, where all gathered ratings made by all
users for all items influence in some way every predicted rating (and each user’s list of
recommendations). The prediction problem chosen here can be seen as predicting missing
values in a sparse matrix. There is also a multitask learning view, where each of the related
tasks is to predict ratings of one user.
Collaborative filtering is not only about predicting user preferences in recommender
systems, there are applications in other domains, but we focus here on the most common
application of predicting ratings. The idea behind collaborative filtering algorithms is that
when users with similar taste rate similar items, the ratings will be close together. It is intu-
itively obvious (we can say that here manifest themselves pattern detection capabilities of
human brain), that movies can be classified into genres, and there are groups of users who
like or not a given genre. It turned out that this intuition of classifying movies into genres
is reflected in the most accurate methods for the task of movie rating prediction, which
include dimensionality reduction. For example, variations of matrix factorization [Fun06]
automatically describe each item by a small number of parameters (30-200), called item
features (for movies, you can see them as automatically learned genres), and a similarly
small number of parameters defining user preferences for features. In addition to including
a layer of dimensionality reduction, which explained most of the variability of ratings,
also important was modelling on multiple scales [Bel07a]: adding a layer with few param-
eters, called global effects, and a highly parameterized layer that models direct item-item
relationships.
In prediction often important is the aspect of making forecasts and the variability
of the model over time. How important is modelling time variability in a recommender
system, depends on the type of data. Because of the chosen task of predicting preferences
for movies, here we are mostly interested in static aspects of prediction. For this type of
data user preferences typically do not change much over the years (it is different for some
other types of items, for example, for news articles, which rapidly cease to be interesting
for users). Nevertheless capturing a certain degree of the time variability of the model
improved accuracy of rating prediction. Particularly significant was modelling a single-
day user bias.
To sum up the developments for the Netflix task, as the result of efforts of many people
working on the task, methods were developed with the accuracy, presumably, close to best
possible. Accurate prediction of ratings is (with some adaptations) a good intermediate
goal for calculating lists of personalized recommendations. In addition, a side result of ac-
curate algorithms predicting ratings are good quality item-item and user-user similarities,
which are useful in various applications.
In this work I attempt to thoroughly review the topic of obtaining accurate prediction
of movie ratings with collaborative filtering methods. We can think of the selection of
presented methods as being filtered by what worked well on the Netflix task, and what
did not. The presentation of the topic is based both on the large body of published work
of many authors working independently on the same task, and is supported by own large
9
experimental work, which resulted also in several improvements of methods and novel
insights.
I begin with introducing the domain of prediction in chapter 2. The basic approach
of probabilistic modelling is outlined, the use of data for inference through Bayes’ rule,
and the understanding of prediction in the framework of decision theory. I describe linear
regression, the most popular method of prediction, I discuss the model selection problem,
and reflect on when accurate prediction is really needed.
Chapter 3 introduces the chosen task and dataset. As for us important is the per-
spective of actual application to calculate personalized recommendations, I begin by sum-
marizing major issues to solve when developing recommender systems, and how rating
prediction relates to generating lists of recommendations. Next I discuss the concept of
collaborative filtering methods as an approach to rating prediction, and generally to filter
items for personalized recommendations. Next the dataset is shortly described, the evalu-
ation criteria used throughout the work, and the first summarization of the data by plots
and tables.
Chapter 4 contains the main part of the work, a comprehensive overview of rating
prediction methods. The description of methods is accompanied by results of author’s im-
plementations, and by listing the experimental results published by other authors. I begin
with a discussion of the simplest models: only the global mean, only biases, and one-
feature regularized singular value decomposition with biases. Simple models illustrate the
emerging issues, common simplifications and choices to make in all methods in the general
approach to prediction taken in this work. Experiments with nonparametric modelling
were presented, that to some degree justify the form of the model of regularized SVD, and
to some degree dispute it. I discuss how to model the output variable, and how to use
the predictions to calculate recommendations. Next in section 4.3 variants of regularized
SVD with multiple features are extensively described. Different ways to learn the mod-
els are discussed, with approximate Bayesian methods and also with neural networks-like
approaches with cost function minimization. The resulting features of the SVD were vi-
sualized for exemplary movies. SVD features define certain notions that one can try to
name. I attempted to interpret the first six features, that explain most of the variance in
data, as a set of six pairs of new, experimental genres, much more informative from the
viewpoint of rating prediction, in comparison to the standard set of genres. Next discussed
are: improving estimation of user features by using implicit information, different ways of
using time data, and the matrix norm regularization view. The following sections 4.4, 4.5,
4.6 discuss efficient approaches to collaborative filtering other than matrix factorization:
restricted Boltzmann machines, KNN-related methods, kernel methods, and other meth-
ods. In section 4.7 I discuss the use of external item metadata. Metadata did not improve
accuracy of methods on the Netflix Prize dataset, but it may be useful in recommender
systems to overcome cold-start problem by content-based prediction of parameters. Section
4.8 overviews different ways of combining models: stacking methods on residuals of one
another, building integrated models, and blending independently learned methods. Best
predictive accuracy was obtained by blending of many accurate, but different methods.
Maintaining a linearly blended ensemble was also a convenient framework that allowed to
evaluate the gradually implemented methods, and that was helpful in exploring the space
of possible models.
Chapter 5 summarizes the experimental results and presents the final ensemble of
methods in a sequence of feature selection, which allows to assess importance of individual
methods. The first few most contributing methods are the individual methods and com-
binations of the following: matrix factorization, RBM, kernel methods, K-NN, combined
with global effects, variability of models in time, and using the structure of missing data.
Chapter 6 describes the construction of SVD-based recommender systems in two (pri-
10
vate) side-projects, and also the use of SVD results for clustering and visualizations, used
in applications that help in discovering similar items.
Finally, chapter 7 summarizes the work with conclusions about collaborative filtering
prediction and with general insights about prediction, drawn from the extensive analysis
of the large Netflix dataset.
The aim of a scientific work is to introduce innovations. The previous publication of
the author [Pat07] introduced techniques as adding biases to the regularized SVD (pro-
posed also in [Tak07a, Tak07b]), postprocessing SVD with Kernel Ridge Regression, and
introduced the methods NSVD1, NSVD2. In this work the novel material is:
• a simple and largely improving accuracy modification of the covariance matrix in
KRR by a time-dependent term (and a similar term improving item-item similarities
in K-NN methods),
• a directed version of RBM with two sets of weights,
• several other minor accuracy improvements,
• experiments with nonparametric modelling, verifying the form of priors for regular-
ized SVD and verifying the choice of multiplication out of possible two-parameter
functions – the main conclusion here was that the common choice of Gaussian priors
for parameters of matrix factorization models is inaccurate.
In addition, there are insights for recommender systems and related applications, how to
adapt rating prediction to create lists of personalized recommendations, how to heuris-
tically adjust the rating prediction for the prediction variance and for the missing data
structure, and how to ensure diversity on a recommendation list. Heuristics are proposed
for cold-start situations in recommender systems: content-based prediction of item features
and user preferences, and a heuristic of adding artificial users who like the group of items,
without modifying a recommendation algorithm.
11
I nie próbuj nic zrozumieć,
Nie pochodzi mieć od umieć.
Postmodernizm, J. Kaczmarski 2
12
by comparing hold-out error) are more vague, and rarely there is a point to look for best
possible accuracy (as in the methods presented in this work), when the right, maximally
useful prediction task is not exactly known.
When attempting to use data for prediction one needs to have in mind, that the
gathered data may come from a different distribution than the data needed to perform
and evaluate predictions. In such cases we are inferring about the whole function based
on observing its values only in part of its domain, for example, inferring about what is on
a whole photo, on the basis on observing only a small part of it. For some datasets and
tasks useful predictions can be obtained this way. For other datasets, using the gathered
data to obtain useful predictions may be difficult. To decide, what is the right way to use
the data, and if the acquired dataset is good enough (or if we need to gather different kind
of data), one has to use domain knowledge and common sense.
Two larger subtasks in prediction are: model identification and estimation of parame-
ters of the model on basis of observed data. Prediction is generally a more strict area than
exploration. Especially the part of inferring parameters with a known probabilistic model
is relatively well understood (see the standard references on machine learning grounded
in Bayesian statistics [Bis06, Mac03]).
This work focuses on a prediction task, but there will be also elements of exploration.
For example, the overall methodology of this study is to draw conclusions from various
summaries and visualizations to iteratively improve models with the discovered effects,
patterns, corrected assumptions. Thus the overall approach is closer to the way of ex-
ploratory data analysis than to the strictly Bayesian way of prediction, where the entire
prior distribution is specified once, before looking at the data, and the data is not reused.
The chosen task motivates focus on the static aspect of prediction, that is, using ob-
served data to infer about unobserved data, without variability of the model over time. For
modelling movie ratings from a recommender system, the dynamic effects, that is, chang-
ing in time, turn out to have much less importance than static ones. To give an idea of the
taken approach to prediction that has proved to be effective in the Netflix task, let’s list
the most useful methods and concepts, which also will be described in more detail in the
next chapters. If you look at the description of experiments and results, it was empirically
found that the best accuracy was obtained on the basis of regression methods, dimension-
ality reduction, an appropriate choice of regularization, integrating different models with
each other, approximating the Bayesian approach in different ways, and also using several
methods distant from probabilistic modelling, such as K-nearest neighbors. Important was
taking a time-effective engineering approach to model identification, maintaining an en-
semble of predictions, looking for and using various effects or interactions in the data, in
parts of models, or in the side-results of methods.
As for the scope of dynamic modelling, time series methods were used only to a lim-
ited extent. Mainly, one-day user effects were modelled. Besides that, also were used:
corrections of distance-based methods for time information, occasional use of exponential
smoothing, and the simple method of binning, mostly for global biases. Some subdomains
of time-dependent modelling, that are commonly used in other applications, like stochas-
tic differential equations, causal modelling, stochastic control, were not used here for the
Netflix task.
In the chapter I first describe selected general issues in prediction, by discussing in sec-
tion 2.1 the decision-theoretic understanding of the task of prediction. In a static situation
the decision-theoretic approach is reduced to choosing a loss function and a family of func-
tions approximating the predicted variable, and then maximizing the posterior expected
loss. There is depth in the simply put task of inference from data to obtain best predic-
tion. Attempts to solve these kinds of tasks led to developing a wide variety of prediction
methods.
13
Section 2.2 discusses the most popular method of prediction, linear regression. On the
example of linear regression we present various concepts that recur also in any other mod-
els: model identification, compliance of the data with assumptions of the model, transfor-
mations of variables, statistical significance, feature selection, outliers removal. Non-linear
variations of regression, such as logistic or Poisson regression, can be realized by iterating
linear regression with weighted observations.
Section 2.3 discusses the important issue of regularization. Linear regression can be
understood as maximum likelihood estimation (maximum a-posteriori estimation for flat a-
priori distribution) of a simple linear model. Experience shows that, instead of maximum
likelihood estimation usually it is much better to use Bayesian methods with a prior
distribution that is not flat, but concentrated around a certain value. We will describe
regularized variations of linear regression: ridge regression, and a more complex variant
of Bayesian linear regression with prior distribution of parameters and a known error
covariance matrix. We will also consider the case of uncertainty in predictors.
In section 2.4 I examine the model selection problem when choosing between two
specified models, or when little is known about the real model that generated the data. I
will justify, but also look critically at the holdout validation method, which is extensively
used thorough the work.
In section 2.5 I discuss when small accuracy improvements are important. A commonly
encountered setting is when similar prediction decisions are made multiple times. With
thousands or millions of repetitions importance of prediction accuracy is greatly amplified.
14
In the parametric approximation we choose the approximating function from a fixed
family of functions ŷ(x, α). We select the parametrization α that minimizes the expected
posterior loss: Z
argmin l(y, ŷ(x, α))dp(x, y|D)
α
The loss function l(y1 , y2 ) is chosen depending on the application. Choice of loss func-
tion is a non-obvious task itself, another source of uncertainty next to identification of the
probabilistic model. Often used here is a quadratic loss function. It is worth to examine
it closer, because the task of Netflix Prize, on which we focus here, is to minimize the
RMSE (root mean squared error) on the test set, which is close to minimizing the MSE
(mean squared error). MSE on a random held-out test set is an unbiased estimator of the
expected quadratic loss, if we treat the test data as not being part of the training data.
Minimizing the expected posterior quadratic loss leads to:
Z Z Z
2
argmin (y − ŷ(x)) dp(x, y|D) = argmin (y − ŷ(x))2 p(x|D)p(y|x, D)dxdy
ŷ ŷ
Z
= argmin F (x, ŷ(x))dx
ŷ
where Z
F (x, ŷ(x)) = (y − ŷ(x))2 p(x|D)p(y|x, D)dy
So the resulting optimal point prediction is the expected value of the output variable.
To have a broader insight into different possible situations in prediction, let’s consider
a different loss function than quadratic, namely the example of binary classification – when
Y takes one of two possible values. We will look at a largely simplified example of medical
diagnostic tests (ignoring the domain adaptation issues). Each test is described by two
probabilities: the probability of a correct guess when a person is ill, and the probability of
a correct guess when a person is healthy. Our decision problem is to select the best test
among multiple available. To select the test that will accurately detect a few sick people
in a large population of healthy people, we should choose an asymmetric loss function: the
value of the loss function for type I error, that is misclassifying a sick patient as healthy,
should be greater than type II error, that is misclassifying a healthy patient as sick.
We can conclude, that in prediction it is not enough to have a good probabilistic
model for data. We need to choose the right loss function, and to do it we need to know
the context of the data analysis. Often necessary is expert’s knowledge in the area where
the data comes from.
After establishing theR loss function, to make predictions by formula (1) we need the
distribution p(x, y|D) = p(x, y|θ)dp(θ|D). In practice we do not know the exact process
that generated the data, that is, we do not know the model p(x, y, θ) and usually the
15
biggest challenge in prediction is to identify the model accurately enough (note that some
aspects of the model can be irrelevant to the chosen loss function, and hence, do not need
to be identified accurately). Similarly difficult are also attempts to model the expected
posterior loss (1) directly, bypassing the explicit construction of the full generative model.
Even if we identify the model fairly accurately, calculating posterior probabilities often
has high computational complexity, which forces us to use approximations, introducing
further inaccuracies.
Let’s say a bit more about the two aforementioned approaches for prediction. Prediction
methods can be divided into two basic types: an indirect approach, called generative, and
a direct approach, in the case of classification tasks (when Y is a categorical variable)
called discriminative. The generative approach is to calculate p(x, y|D) (the posterior
probabilities after taking into account the training data), using Bayes’ rule, and then to
choose the approximation of y|x that maximizes (1).
Because our loss function is focused on approximating the output variable y, often a
fully generative model p(x, y|D) of all observed variables is not really needed. It often
may be enough to discover the general relationship p(y|x, D) without exact modelling the
distribution of the predictor variables p(x|D) (we can also model the expected posterior
loss directly, as suggested by the formula (2), without explicitly computing the posterior
probabilities as an intermediate step). This direct approach (in classification tasks called
discriminative), because it does not need to model the part of data unrelated to the
chosen prediction task, results in methods with fewer parameters to be learned than the
corresponding generative methods. An example can be binary classification of a variable
drawn from one of two multidimensional Gaussian distributions. The generative approach
leads to methods like LDA or QDA with O(k 2 ) parameters, where k is the dimension of
the Gaussians (the number of predictor variables). The direct approach leads to logistic
regression methods with only O(k) parameters.
If we knew the probabilistic model exactly, the generative approach is equivalent to
the direct (discriminative) approach. In practice, we usually have limited capability of
identifying the part p(x|D), and trying to model that part takes effort, and can introduce
inaccuracies. Sometimes it is useful to consider generative models, but I almost always
tend to use the direct approach.
A whole variety of methods realize the direct approach and the generative approach.
For different structures of hidden variables and parameters (which can be seen as hidden
variables), as well as for various types of uncertainty about the form of the model that
generated the data, one can propose different approximations. If we need very accurate
prediction, we can use MCMC or Variational Bayesian approximation. Many methods can
be regarded as high-level approximations of the Bayesian approach, for example, methods
inspired by biology, such as neural networks, or methods inspired by statistical physics,
such as Boltzmann machines. There are also effective methods, for which the connection
to probability is not obvious, such as K-nearest neighbors.
The next section describes the most popular and best-examined prediction method,
linear regression.
16
prior distributions, the inference method, the computational complexity, or whether the
optimization problem has local minima to avoid.
I will discuss briefly linear regression, which is the most widely used method of pre-
diction, while being one of the simplest and most thoroughly studied. Linear regression
owes its popularity to the prevalence of the normal distribution in nature, which is a con-
sequence of the fact, that sums of independent variables, when the number of variables
increases, converge (apart from a few exceptions) to a normal distribution.
The first known use of multivariate linear regression was by Gauss, who used this
method to estimate the path of the dwarf planet Ceres using small number of observations
[Gauss1809, Ste95]. To solve the resulting system of linear equations he used the algorithm
known today under the name of Gauss elimination.
In linear regression we assume a linear relationship between a vector of p constants xi ,
called explanatory variables, covariates or predictors, and a variable yi , called the response
variable, predicted variable, or output variable.
The linear regression model has the form:
y = Xβ + ε
Eβ = (X T X)−1 X T y
V ar β = (X T X)−1 σ 2
When applying regression to real life data coming from an unknown distribution nec-
essary is regression diagnostics [Far04, Far06], that is making sure that the model assump-
tions are sufficiently met by the data. Residuals should be examined to test if the errors are
normally distributed, independent, and homoscedastic (have constant variance). Standard
plots used for the above purpose are: residuals vs. fitted values, quantile-quantile plots of
residuals vs. normal distribution, plotting square root of standardized residuals vs. fitted
values, or autocorrelation plots for time dependence. We should check whether the rela-
tionship between the predictors and the response is really linear, and whether eventual
transformations of the response and predictors should be performed (Box-Cox procedure
can suggest transformations of the response, and generalized additive models can suggest
transformations of the predictors). Sometimes it is useful to standardize (normalize) the
predictor variables, that means, to subtract the estimated mean, and divide by the es-
timated standard deviation. Another thing to do is removing or otherwise dealing with
outliers, that is unusual observations, which visibly do not come from the assumed model
(often outliers result from errors in the process of creating the dataset). Special care should
be taken to examine influential observations – which removal considerably influences the
regression coefficients. Plots usually used to look for outliers are: residuals vs. leverage,
Cook’s distance vs. leverage, and one-dimensional boxplots.
17
In an ordinary, non-regularized linear regression usually feature selection improves
generalization performance. Commonly used criteria for selecting or rejecting predictors
are:
• statistical significance (paying attention to multicollinearity)
• comparing Akaike Information Criterion (AIC) or Bayesian Information Criterion
(BIC) of models. BIC is p log n − 2 log L, where L is the data likelihood, p is the
number of parameters, and n is the number of observations. AIC is 2p − 2 log L.
• change in residual sum of squares (RSS)
A criterion that is sometimes used, but should be avoided (often gives misleading infor-
mation about the dependence between two variables) is the simple Pearson correlation
coefficient between the response and a predictor.
To roughly assess goodness of fit, proportion of the explained variance P to the total
2 = (T SS − RSS)/T SS = 1 − 2/ 2
P
variance can be used: R (ŷ
i i − y i ) i (yi − ȳ) =
2 2
P P
i (ŷi − ȳ) / i (yi − ȳ) , where ŷi are the predictions, yi are the observed values, and ȳ
is the average value of yi .
In a situation of non-constant variance of errors (heteroscedasticity) weighted linear
regression can be used:
β̄ = (X T W X)−1 X T W y
where W is a diagonal matrix of weights wii = 1/σi2 . Regression with a non-Gaussian
response variable is called generalized linear regression, and is realized by iteratively
reweighted least squares – running several iterations of weighted linear regression with
weights wii dependent on the current estimates ŷi . For a binary response it is logistic regres-
sion with wii = ŷ(1 − ŷ). For a Poisson response it is Poisson regression with wii = exp(ŷ).
We could also use an additional regression component to model the varying variance of
errors, with the same predictors that are used to model the mean, or with a different set
of predictors.
Ordinary linear regression and generalized linear regression are results of the maximum
likelihood method. Experience shows that maximum likelihood estimation often results in
overfitting the training data, meaning that the method fits the training data closely, but
does not generalize well on the unobserved data (as observed by increased error on held-
out test data). Instead of using a flat prior distribution on parameters β, it is usually
better to use a prior centered around some value, leading to regularized estimates.
2.3 Regularization
Using a wrong model or wrong prior distributions for parameters usually leads to subop-
timal accuracy (it may be visible as overfitting or underfitting the data). The last section
introduced the method of linear regression, which is a result of maximum likelihood esti-
mation, that is MAP (maximum a-posteriori) estimation with assumed flat distribution on
parameters. We can list many drawbacks of flat priors: they do not appear in nature (they
are improper priors – are not probability distributions), and using them usually leads to
overfitting the data and to inferior accuracy in comparison to priors concentrated around
some value. The most common distribution in nature is normal distribution and this prior
distribution seems the to be most natural choice for parameters β in regression (we should
note that a normal distribution with a large variance will have a similar effect to using a
flat prior).
Empirically, we usually observe that, in comparison with the maximum likelihood
method, a technique called regularization improves predictive accuracy. Regularization,
called sometimes weight decay or parameter shrinkage, adds to the likelihood function a
term penalizing large weights. The most common form of regularized linear regression,
18
called ridge regression, uses L2 penalty for weights (also known as Tikhonov regulariza-
tion), resulting in the following formula:
β = (X T X + λI)−1 X T y
The L2 regularization comes from MAP estimation in a linear regression model with
Gaussian priors on parameters β, and with a known, fixed Gaussian error term. Other
types of regularization are less common, as, for example, L1 regularization, which gives a
method called lasso regression.
We can select the optimal λ, for example, by cross-validation, that is by randomly
choosing a subset of observations as a held-out test set, and learning the parameters β on
the training set containing the remaining observations. The procedure is repeated multiple
times and the resulting accuracy is averaged, which gives an estimate of the predictive
accuracy on unknown data (see also the next section for discussion of cross-validation).
When λ is lower than the optimum, we typically observe a phenomenon of overfitting:
with a reduction of λ the training error decreases, but the error on the held-out test set
increases. When λ is larger than the optimal value, we observe underfitting: the estimated
parameters will be closer to zero, and both the training error and the test error increase.
Ridge regression bases on an a-priori belief that the values close to zero are more likely
than the values distant from zero. Ridge regression is the result of Bayesian inference in
the following probabilistic model, where y depends on X linearly through the unknown
parameters β plus white noise with fixed, point estimated variance σ 2 :
y ∼ N (Xβ, σ 2 I) (3)
Using Bayes’ rule p(β|X, y) ∝ p(y|X, β)p(β) we calculate the posterior distribution of β
after taking into account the observed data X, y:
β ∼ N (Eβ, V ar β)
σ 2 −1 T
Eβ = (X T X + I) X y
τ2
1 1
V ar β = ( 2 X T X + 2 I)−1
σ τ
We see that the obtained PME (posterior mean) or MAP estimate of β is equivalent
to ridge regression with λ = σ 2 /τ 2 .
If there are more features (columns of X) than observations (rows of X), faster to
calculate is the dual form of the ridge regression, kernel ridge regression – PME of the
corresponding Gaussian process:
The above simple probabilistic linear model (ridge regression) performs well enough
for a wide choice of data and prediction tasks, even when the underlying assumptions are
not exactly met, but often a more flexible model is needed. Common modifications are:
• other prior distribution of parameters, with mean other than the vector zero, and
with parameters not independent a-priori, but with the dependence structure defined
by a covariance matrix S
β a−priori ∼ N (β 0 , S) (5)
19
• heteroscedastic errors (varied variance - can be seen as weighting of observations) and
not independent, with the dependence structure defined by a covariance matrix Ω
y ∼ N (Xβ, Ω) (6)
In the simplest setting S, Ω are constant matrices, and X also remains constant. As
W = Ω−1 we denote the precision matrix of errors. Then:
β ∼ N (Eβ, V ar β)
Eβ = (X T W X + S −1 )−1 (X T W y + S −1 β 0 )
V ar β = (X T W X + S −1 )−1
The above model is very close to the Black-Litterman model used in portfolio opti-
mization [Bla92, Pal08]. Again, predictions in the dual version are:
The parameters S, Ω in (5), (6) were constant matrices, and τ 2 and σ 2 were constants
in the model (3), (4). Instead of using fixed constants or point estimators, these parameters
can be treated as random variables with defined priors (for S and τ 2 called hyperpriors).
Those additional priors could be estimated, for example, from similar tasks in a multitask
learning setting.
All of the above is not enough to obtain best accuracy in matrix factorizations, which
can be seen as performing simultaneous ridge regressions: one regression for each user,
and one regression for each movie. The matrix of predictors X is not constant in those
regressions, but consists of random variables with different amount of uncertainty. It was
noticed experimentally [Fun06], that better than a constant regularization term is using
an amount of regularization growing linearly with the number of observations. Variational
Bayesian matrix factorization [Lim07, Rai07] explains to some extent this phenomenon
– the influence of uncertainties from each observation sum up to a term that behaves as
additional regularization. A well performing modification of ridge regression for the case
of non-constant X entries contained both the linear and the constant regularization term:
β = (X T X + (λ1 + λ2 n)I)−1 X T y
20
of the common decisions to make when using a linear model: the choice of input variables
(predictors) and interactions between input variables, transformations of input and output
variables, weighting observations, the choice of prior distribution for parameters. A large
challenge is that domain knowledge is often intuitively expressed (or even unconscious)
and difficult to translate into a probabilistic model.
A practical approach to prediction for real-life data is using the data repeatedly, itera-
tively improving the model. Limited reuse of data should not cause noticeable overfitting
of the data (choosing a model with worse performance on unseen data). Having more
data allows to choose between more models, that is why larger datasets are interesting
for researchers, allowing to use a wider selection of techniques than when working with
small datasets, and allowing to better understand the underlying process that generated
the data.
The process of searching for the best model can be treated in simplification as a series
of model selection tasks, that is, we are faced multiple times with a choice between two or
more models (sometimes infinitely many). The choice of models for comparison may have
different origins: it may result from domain knowledge, from hypothesizing about possible
effects in data, or from summarization or visual examination of data, noticed patterns,
dependencies, probability distributions, incompliances with the model assumed so far, etc.
In the model selection process, usually smaller and simpler models are preferred (a
heuristic rule called Occam’s razor, sometimes formalized as a minimum description length
principle). An a-priori assumption that simpler models are more probable than complex
ones would be disputable. Nature sometimes favors simplicity, for example, isolated phys-
ical systems approach thermodynamic equilibrium, and sometimes nature favors complex-
ity, for example, simple rules can lead to complicated fractal structures [Man82, Wol02].
Independently of whether we a-priori expect a simple or a complex model, if we consider
concrete situations of model selection between similar models with a certain amount of
observed data coming from one of the models, then either calculating evidence of each
model (and the Bayes Factor), or considering a chosen measure of prediction error lead to
model selection criteria like AIC or BIC, which penalize larger models. Besides better pre-
dictive accuracy, there are many engineering reasons to prefer simpler models – they are
easier to work with, more likely have a tractable algorithm for parameter inference, need
less code, are less time-consuming to implement, and are less prone to implementation
errors. On the other hand, we can mention some advantages of larger models – they are
useful for model identification, can help to spot important effects which smaller models
are not capable to capture. We should remember that a choice of one of many models is a
simplification, valid when the remaining models are very implausible. Sometimes proper
Bayesian averaging of models is needed, instead of selecting one model.
I will examine a situation of model selection between two fixed, simple models M1 , M2 .
Assume we have N samples of data D, generated iid (independent, identically distributed)
from one of two models (distributions), with the a-priori probability of the models: p(M1 ),
and p(M2 ) = 1 − p(M1 ). The posterior distribution is thus following:
p(D|M1 )p(M1 ) 1
p(M1 |D) = = p(D|M2 )p(M2 )
p(D|M1 )p(M1 ) + p(D|M2 )p(M2 ) 1+ p(D|M1 )p(M1 )
21
and each of the two possible actual drawn models Mreal . The optimal decision is given by
minimizing the expected posterior loss:
The optimal decision function d(D) depends on the choice of the loss function l, which
states the cost of the errors of the first and second type, that is the cost of selecting the
wrong model in our meta-task. Let’s denote the values of loss function for the four possible
arguments by: l11 , l12 , l21 , l22 , where the first subscript indicates the model selected, and the
second denotes the actual model, which generated the data D. A reasonable assumption
about our loss function is that the correct recognition of the model is better than the
incorrect, that is l11 < l21 and l22 < l12 . Minimizing the expected posterior loss (7) gives
2 |D)
a decision rule selecting the model M1 iff p(M l21 −l11
p(M1 |D) < l12 −l22 . We see that the decision-
theoretic approach to the meta-task of a choice between two known models is equivalent
to using Bayes Factor for a cut-off threshold dependent on the chosen loss function.
The loss function in the meta-task can be directly related to the loss function used to
evaluate the generalization error. For example, in the Netflix Prize task we are interested
in expected RMSE on unseen data (from the distribution of the observed data), so if
we want to compare two models and choose one of them, it seems reasonable to use the
expected RMSE also as the loss function in the meta-task of model selection. Because for
real-life data we do not know the probabilistic model that generated the data, especially
we do not know much about it when we begin the analysis of a given dataset, useful
are universal criteria, same for all models. In practice, popular universal model selection
criteria are methods comparing the validation error (here the average loss function) on
held-out validation data (a test set). Widely used is k-fold cross-validation, where the data
is divided into k equal parts, each of them serves one time as the test set (the remaining
parts form the training set), and the final result is the average validation error from the
k-times learned model. An advantage of methods based on hold-out validation error is
their simplicity and versatility. Such methods are the same for different datasets, even for
a completely unknown model.
Having different possible criteria of model selection, one can wonder, when one criterion
is better than another, by how much better, and what “better” means. We will compare,
for a choice between two fixed simple models, two groups of model selection criteria: one
group is using Bayes Factor for different thresholds (equivalent to choosing a loss function
in the meta-task of deciding on one model), and the second group are universal, model-
independent criteria measuring validation error. For those model selection criteria we will
compare the frequency of type I and type II errors, that is, how often the given criterion
identifies the correct model. In this toy study we evaluate model selection criteria according
to how well they select a more probable model, but instead we could, for example, evaluate
model selection criteria according to how well they select a model that has better predictive
accuracy (for a given measure of predictive accuracy).
Let’s look closer at a case of model selection between two known simple models. The
accuracy of different model selection criteria will be judged by frequency of making type
I and type II errors, that is, how often a given criterion is wrong and chooses M2 , when
the data comes from M1 , and how often it chooses model M1 , when the data comes
from M2 . The data is generated as follows: first with probability 12 we choose the model
M1 or M2 (we assume, that the models M1 and M2 are a-priori equally likely), then we
generate the parameter µ ∼ N (0, 1) for the model M1 or two independent parameters
µ1 ∼ N (0, 1), µ2 ∼ N (0, 1) for the model M2 , and generate N data samples from the
selected model. In the model M1 , N iid samples are drawn from the distribution N (µ, 1).
In the model M2 , N2 iid samples are drawn from the distribution N (µ1 , 1), and N2 iid
samples are drawn from the distribution N (µ2 , 1).
22
The evidence function for the model M1 is following:
1 (PN y )2 1 XN
√
Z
N
i=1 i 2
p(D|M1 ) = p(D|µ, M1 )p(µ, M1 )dµ = (2π) 2 N + 1 exp − yi
2 N +1 2
i=1
We assumed, that the models are a-priori equally probable: p(M1 ) = p(M2 ) = 12 . Then
Bayes Factor is the ratio of M2 evidence and M1 evidence:
P N2
yi ) 2 + ( N y )2 1 (PN y )2 )
P
p(D|M2 ) N
+ 1 1 ( i=1 i= N +1 i
2 i=1 i
=√ exp N
2
−
p(D|M1 ) N +1 2 2 +1
2 N +1
Figures 1 and 2 show the precision-recall relation for predicting the model M1 using
different model selection criteria. Bayes Factor with changing cut-off threshold level (BF
p(D|M1 )
on the plots denotes log p(D|M 2)
) is compared with several holdout validation methods:
• VAL20% - random 20% of data as a validation set, with the remaining 80% used as
the training data,
• VAL50% - random 50% of data as a validation set,
• VAL20%×5 - averaged result of 5 draws of VAL20%,
• VAL50%×5 - averaged result of 5 draws of VAL50%,
• CV5FOLD - 5-fold cross-validation,
• CV10FOLD - 10-fold cross-validation.
Figure 1 compares precision-recall for N = 100 data samples, averaged over 50, 000 draws
of models and data. Figure 2 compares precision-recall for N = 2000 data samples, aver-
aged over 100, 000 draws of models and data to evaluate Bayes Factor, and 10, 000 draws
to evaluate the holdout validation methods.
We observe visibly worse accuracy of the methods based on the hold-out validation
error, in comparison to the optimal decisions (for a desired recall or precision level) by
Bayes Factor. We should remark, that for real-life data we rarely know the exact form of
the models we want to compare. Methods like cross-validation are more convenient because
they are identical for any data, even when the underlying model is completely unknown.
In this work the following methods were used for model selection:
• for the purpose of comparing methods RMSE15 was used, that is using 15% of the
probe set (about 0.2% of all available ratings) as a held-out validation set; this
validation set was used also for automatic and manual tuning of parameters, which
can be seen as model selection among a large number of similar models,
• for the purpose of feature selection were used: calculating statistical significance in
linear regression (or ridge regression) and comparing change in the sum of squared
error (SSE) (see chapter 5 “Experimental results”),
• occasionally in linear regression were used criteria like AIC and BIC [Far04] (also
their greedy, automated versions, stepAIC and dropterm functions in R), that is
SSE (or negative double log-likelihood in general) corrected by subtracting the term
Cp, where p is the number of parameters of the model, and C is a chosen constant
(C = 2 in AIC, and C = log n in BIC).
23
Figure 1:Precision−recall: Crossvalidation
Precision-recall: Cross-validation vs.vs. Bayes
Bayes Factor
Factor, N=100
N=100
1.00
Bayes Factor
0.95
0.90
Precision
BF> −1
0.85
BF>0
VAL50%x5 ●
0.80
VAL50% ● BF>1
● ●
VAL20%x5● ●CV5FOLD
0.75
● ●
VAL20% CV10FOLD
Recall
A validation set of size 0.2% of the training set may seem small, even if it is about
210, 000 data points for the training set with 100 million data points, but in my judgement,
model assessment using this single validation set was good enough for the goals of the
experiments. The validation set of this size allowed for extensive model selection, feature
selection, and parameter tuning, without introducing large overfitting.
There is a difference between model selection to find the most plausible model, and
model selection optimizing the selected criterion of expected loss. RMSE on a validation
set has an advantage of being a good estimator of generalization RMSE (our expected
loss). It is distant from being the most effective model selection criterion, but it worked
well enough for us in the Netflix task.
Because the amount of available data is always limited, smaller models are preferred,
and it reinforces the need of model selection. We can see limitations of methods, such as
based on decision trees, that perform a large number of model selection decisions, usually
based on a small amount of data. Large-scale, automatic feature selection with millions
of candidate features does not work well, even with very large datasets, unless we can use
domain knowledge to limit the set of features considered.
24
Figure 2: Precision−recall: Crossvalidation
Precision-recall: Cross-validation vs. vs. Bayes
Bayes Factor
Factor, N=2000
N=2000
1.00
Bayes Factor
0.95
BF> −1
●
●
VAL50% BF>0●●
● BF>1
VAL20%
0.90
●
VAL50%x5
Precision
● CV5FOLD
● ●
0.85
VAL20%x5 CV10FOLD
0.80
0.75
Recall
25
methods (adjustment for missing data), which yet has to be tested, and the best way of
adjustment has to be determined.
Because the task of proposing lists of recommendations is not precisely defined, and
because usually we are limited to tune accuracy on ratings coming from the training set
distribution (rated are only items chosen by the user), it can be disputed whether there is a
point in striving to obtain maximum accuracy for an inexact task on an imperfect dataset.
This work is focused on obtaining best accuracy for the chosen task of rating prediction
evaluated by RMSE on the data distribution, treating the task as if it were fully relevant.
Such extensive analysis coming from the Netflix Prize competition, besides helping to
understand the domain of recommendations, may contribute to better understanding of
other prediction tasks or data mining tasks.
Because also a wider scope interests us, not only recommendations and prediction of
ratings, let’s digress about a commonly appearing general case in prediction, repeated bets.
Accuracy of prediction is typically the more important, the more times a given situation
of prediction repeats (an extreme case I know is an evaluation function in computer chess,
which needs to very accurately predict the likelihood of winning in a given position –
executed 109 − 1012 times during one game). Many problems can be considered as series of
similar bets. We will look at fractional bets, where as the stake we can pick any fraction
of the current bankroll. Fixed-size bets with 0-1 decisions whether to enter the bet can be
seen as a special case of fractional bets. The subject of repeated bets is loosely related to
recommendations (recommendations can be seen as bets on user’s satisfaction and user’s
time, but rather we cannot speak about any bankroll here). Repeated bets are a clear,
common situation in finance (also personal) or gambling, where usually the notion of
bankroll can be highlighted and situations of making bets of certain value can be precisely
distinguished.
As repeated betting we understand the following situation of making a fractional bet,
repeated N times. Each bet is associated with a certain event (Bernoulli trial), which can
happen with probability p, and we can bet a fraction r ∈ [0, 1] of the current bankroll
on whether the event happens. If the event happens, we win αry, which is added to the
bankroll, otherwise we lose ry. To simplify we assume that all events are independent (but
remembering that in reality they usually are not), meaning that the series of events is a
Bernoulli process. Each sequence of N decisions ri leads to a certain distribution of the
final bankroll, which has a certain value for the decision maker. One could try scoring
the final distributions with a chosen utility function, individual for a decision maker – a
person may be risk-averse or risk seeking. A common simplification is to use the expected
logarithm of the final bankroll as the utility function for the decision maker. Then in each
round of betting the optimal decision is the same ri = r. Our utility score U is E log(∆b),
the average growth rate in a single turn of betting:
where r is a fraction of the bankroll bet, α is the reward, and p is the probability of
win. Maximizing U with respect to r we get the result called Kelly criterion [Kel56] – the
fraction of the bankroll should be chosen according to the rule:
1 1
r̂ = max(0, (1 + )p − ) (9)
α α
In practice, we do not know the true probability of win, and in its place we can use a
point estimator. Here comes the importance of prediction accuracy – we can wonder, what
would happen, when our estimate of p differs from the optimum by a certain value ε. By
how much U will differ from using the optimal prediction? We assume that the estimator
p0 known to us is subject to error p0 = p + ε, and that p0 and α are large enough to r̂ > 0.
26
After inserting into (8) the r̂ chosen according to (9) we get:
The above case was largely simplified. In a real case we should choose a better utility
function, develop an accurate predictive model for events, including dependence between
events, and the predictive model should make use of new data during the process of
betting.
27
– So what’s the difference between you and me?
– Because you don’t have to survive.
You’re weak. You have emotions.
You play little games with your mind.
You chase your tales.
28
For a user, employing a recommender system is one of the ways to reach the informa-
tion that interests him. Discovering or rediscovering information is all the more difficult,
as we have now a situation of information explosion – sudden appearance of large amounts
of data, mainly human-generated content on the Internet. A user wants to find relevant,
good quality information in these large datasets, and filter out irrelevant information, all
while saving his time and effort. For this purpose usually standard techniques of informa-
tion retrieval are used, like various types of search in conjunction with indexing content
descriptions, metadata or tags. Recommender systems complement the information re-
trieval techniques in situations when the user does not know exactly what he is looking
for, or when the content is not described accurately, or not popular enough to appear high
on search results. Recommender systems gather information about likes and dislikes of
each user, and utilize the mechanism of personalization. The gathered information about
preferences and tastes that is expressed directly, for example by ratings, often turns out
to be more informative and more important than the content-based information.
To be sure that we are solving the right problem in a recommender system, we have to
look broadly on the human-computer interaction perspective – what is recommended to
a user, how and why. Important is the statistical perspective: which data is collected and
how is it collected, what is the meaning of the data, how to overcome various cold start
problems, and what is the right prediction task to solve. Important is also the software
engineering perspective: simplicity of design of the recommender system, reliability, speed,
amount of resources used.
Many deployed recommender systems have been described in the literature [Gol92,
Res94, Hil95, Kon97, Gol01, Sch02b, Mil03, Lin03, Ali04, Pra11, Ama12]. The article
[Mon03] overviews 37 early recommender systems. At that time the most popular tech-
nique employed were recommendations based on user-user similarity. Currently increas-
ingly frequent is the use of matrix factorizations. Other articles summarizing the subject
of recommender systems are [Sch99, Ric10, Jan10].
Let’s list the issues usually needed to be considered when developing a recommender
system: (the list largely follows the presentations in [Her00a, Bur02, Her04, Sch07])
• User’s perspective.
Let’s look closer at one of the perspectives mentioned, the user’s viewpoint. Here
the design of human-computer interactions is especially important: how the user
interface looks like? Is the interface simple and easy to use? How the recommender
29
is integrated with the larger service? What is recommended to the user and how?
Which additional features are implemented?
Important parts of the design are navigation methods in the recommendation service,
and interaction of recommendations with other techniques of information retrieval.
Recommender systems are often complemented with features like: search, catego-
rization, tags, specialized query interfaces [Gol92, Sch02b], faceted search, browsing
similar items, social navigation, finding similar-minded users. The work [Sch02b] pro-
poses meta-recommenders, which give the user extensive possibilities to customize a
movie recommender system, allowing to filter movies on the genre attribute, MPAA
rating, film length, objectionable content, critic’s rating, or distance to theater, and
allowing the user to indicate the importance of each filter. [Mey12] lists four core
functtions of a recommender system: decide, compare, discover and explore.
Users may particularly like a service that minimizes their effort, for example, allows
to rate whole groups of items (like movie genres), presents to the user a varied
set of items (or groups of items) to rate, but a set of items known to the user.
They may like a service that minimizes the number of ratings needed to obtain
quality recommendations, allows to import or export ratings, or otherwise expressed
preferences, between different services. Besides using ratings, a different idea is asking
the user questions about his actual need, to let the user to express his preferences
and get relevant, personalized recommendations quickly. Such decision-tree-based
approach, thought as a decision-support process, was tried out by the Hunch.com
website, where users answer series of questions on a selected topic, and at the end
they are provided a recommendation.
Apart from presenting a list of recommendations or predicting user preference for a
given item, one of the desired features is explaining recommendations to the user.
Several different realizations of explaining recommendations were analyzed by a users
study in [Her00b]. In that study the best accepted way of explanation turned out
to be a three-bar histogram of ratings given to the movie by similar users. Other
well performing explanations were similarity to other movies, and a favourite actor
or actress.
An alternative is a black box approach, producing recommendations without further
explanations. In fact, explaining to users the exact result of best matrix factorization
algorithms would be a difficult task ([Pil09c, Pil09d] is an attempt). Fully explaining
recommendations is a lot easier for item-item or user-user K-NN approaches, when
the K parameter is small.
Different users can expect different properties of recommendations, as diversity, need
for novelty, surprise factor.
One should consider, how the recommender systems are used, for example, common
in commercial services is sharing a single account by multiple people. Another exam-
ple – accurate recommendations seem to be more important for DVD movie rental
than for movie streaming, where a user can sample a few movies before deciding on
the one to watch [Ama12].
• Data collection
Important questions to answer are which data to collect and how to do it? What
does the data mean? How close is the meaning of the data to the desired meaning.
Information about user preferences can be gathered in different ways, leading to
datasets with different properties. Some kinds of data are more useful for recom-
mendations, or easier to use properly than the others. We want the recommendation
30
algorithm to learn what user thinks about items he knows, and avoid learning the
mechanism that presents the data to the user.
Preferences about items can be collected with active or passive user participation
(called also explicit vs. implicit feedback). Explicitly (directly) expressed user pref-
erences, such as item ratings, are more robust to the actual missing data mechanism
than passively gathered user preferences.
Active participation is when the user expresses his preferences by conscious choice,
using a specialized interface. The user can give a rating, usually on scale 1 − 5 or
1 − 10, or can click like or dislike buttons. Other types of explicit feedback can be:
ordering items according to personal preference, choice of one item of two or more
presented, adding an item to the list of favorites, to the queue to watch, or indicate
that he does not know the item (has not seen the movie), or is not interested in
the item. Collected feedback can be positive and negative, or only positive (“like”
buttons). Having only positive feedback, it is desirable to augment it with informa-
tion about which items were exposed (displayed, etc.) to each user. Some types of
collected data indicate that a user likes an item, other indicate that a user needs
an item, for example, purchase data [Pra11] or search queries indicate a need. More
complex preference gathering processes are sometimes used, like conjoint analysis
[Cha05] in marketing research, where a user chooses one product among a few in a
specially designed survey. The work [Cha11] describes various experimental schemes
of qualitative and quantitative preference elicitation. Users can indicate their tastes
in different ways than by evaluating items, for example, it may be easier for a user to
express that he prefers one item over another, or to indicate that he likes a certain
actor, director or writer, or to answer a supporting question about what he is looking
for. Recommendations can be also based on data gathered from personality quizzes
[Hu10].
The second option is passive data collection (implicit feedback), where the prefer-
ences are not expressed by the user, but are inferred from user behavior. Passively
gathered data are for example: click data, time spent on website, social network in-
formation. Usually passive data collection does not indicate accurately whether the
user likes an item. Some types of implicit feedback are more accurate, for example:
time spent on watching a movie, watching a video, etc.
Sometimes the influence of loopback effects on the operation of a recommender
system needs to be considered. For example, if the recommender system is based
on click data, a lot of clicks cause that the item is more often recommended, and
more frequent recommendations cause even more clicks. The loopback effect may
negatively affect the quality of recommendations, especially for passively gathered
data.
Which way of collecting preferences is the best, this may differ depending on what
kind of items are being recommended. The most common types of recommended
content are: movies, music, books, games, places, websites.
In addition to data collected from users, a recommender system can make use of fixed
data, such as metadata about items, tags, taxonomies, descriptions. For movies it
may be genre, actors, writers, etc. We can also gather additional metadata about
users: age, gender, location, language, etc. A recommender system that bases on
metadata is called content-based. Often information about content is gathered by
users in the form of tagging (called also social annotations, folksonomy). [Bur02]
categorizes recommender systems into: collaborative, content-based, demographic,
utility-based and knowledge-based.
31
The data analyzed in this work are movie ratings. It turned out, that ratings carry
exceptionally large amount of information about the items, and for the purpose of
recommendations it is usually it is better to have small number of ratings than to
have a large amount of metadata [Pil09b]. Content-based recommendations are gain-
ing importance in cold start situations, when there are only a few ratings available
(for item or for user), for new users and new items, and for items from the “long
tail”.
One can reflect on the psychological perspective: why the user gives one rating and
not another. The distribution of ratings varies between different recommender sys-
tems. On Netflix movies are rated on 1−5 scale. The most frequent rating is 4, which
appears in the Netflix Prize dataset about 32% times. On Youtube, on the same scale
1 − 5 the rating 5 appeared over 90% times, and the second most frequent is rating
1 (Youtube data from 2009, before switching from ratings to binary assessments like
vs. dislike). Probably that the main reason of the observed difference is that on Net-
flix the users rate movies to express their preferences to obtain better personalized
recommendations. On Youtube the personalized recommendations module is not so
exposed as on Netflix – it is not visible on the screen, where the video is rated. We can
speculate that the main incentive behind giving ratings on Youtube was the intention
to influence the visible average rating, to help the community in evaluating the video,
hence extreme evaluations are prevalent there (incidentally, users were much more
likely to give high ratings than low ratings). Displaying the predicted rating can bias
the rating given by user, as shown by a user study in [Ado11], which supported the
anchoring hypothesis: “Users receiving a recommendation biased to be higher will
provide higher ratings than users receiving a recommendation biased to be lower”.
Besides the listed motivations to give ratings (improving the received personalized
recommendations, influencing the displayed average vote, helping community), other
motivations may be: fun factor, storing ratings as a help to the memory or to import
them to other services, showing your taste to friends, expressing self [Her04].
It may be useful to separate the preference expression options in a user interface
according to whether the user watched the movie or not (whether he knows or does
not know the item). In the first case, the user can be presented options such as: give
a rating or like vs. dislike, add to favourites, add to blacklist. If the user has not yet
watched the movie, he needs a different selection of options, to indicate if he: wants
to buy the movie now (or watch in a theater), wants to buy in the future, wants to
watch for free (for example, in TV), is undecided (wants to be reminded about the
item later), is not interested with the item, is not interested and does not want to
recieve similar recommendations.
• Measure of accuracy
To develop a recommenation algorithm we decide on a measure that evaluates quality
of recommendations, and then we try to choose a possibly most accurate algorithm
according to this measure. Multiple issues have influence on the perceived quality
of recommendations. We can mention economical or psychological factors: user’s
satisfaction, ease of use, simplicity of the recommender system, opportunities to
save user’s time, ease of finding quality content. Guessing the current goal of the
user may be important. Is it better to give a general recommendation according to
the guessed user taste, or is it better to recommend an item similar to the being
currently viewed, or to recommend items according to the last actions, like searches
on the website and last viewed items? We can consider many other factors coming
from each of the three coming from each of the three mentioned earlier perspectives:
user’s, site owner’s (like increasing sales) and item owner’s (like coverage – whether
32
the item has a chance to be recommended, even when it is new or niche, from the long
tail). Factors listed above are difficult to measure. To gain more insight one could try
comparing several versions of a recommender system using users surveys. A broad
examination of multi-factor evaluation criteria with user surveys was performed in
[Mee09, New10]. To simplify the problem and the resulting algorithms, we should
narrow the accuracy criterion, and measure only the most important characteristics.
Selecting items to a top-K recommendation list is a classification task, and we def-
initely should examine, typically considered in classification tasks, precision-recall
balance. Precision is the proportion of relevant items on the top-K list, and recall is
the proportion of all relevant items retrieved to the top-K list (relevant items are the
items that the user would rate high). An obstacle in measuring precision and recall
on the Netflix dataset is that we have only ratings for user-selected items, and we
do not have a good way to estimate which of all items are relevant for a user. The
subject of precision and recall, and its relation to RMSE evaluation, missing data
structure, and uncertainties in predictions, is expanded in section 3.4 “Evaluation”.
Typically used ranking-based criteria defined on the observed data (such as NDCG
- normalized discounted cumulative gain, mean average precision, half-life utility
[Her04], or probabilistic ranking cost functions [Bur05]) have the same disadvantage
as mentioned top-K precision or recall – they ignore missing data effects, and hence
do not evaluate accurately the error made on all items.
Ranking-based criteria are difficult to directly optimize. A well working simplifica-
tion is to use an indirect, two-step method: first to predict user preferences (ratings)
for unknown or unrated items, and then to sort items according to the chosen way of
sorting predictions (to optimize a chosen ranking-based measure). Netflix proposed
evaluating rating prediction by RMSE on the held-out set of newest ratings, and this
criterion is used to evaluate methods in this work. In place of RMSE sometimes MAE
(mean average error) is used in the literature. A disadvantage of using RMSE this
way is that if we learn our algorithm on the training data distribution, predictions
for missing data will be biased – if we asked the user, his rating typically would be
lower than the calculated prediction. Also, RMSE allows us to assess if we predict
the expected rating well (on the observed data distribution), but to calculate rec-
ommendations, our algorithm should also estimate uncertainty of predictions made
(more about it in sections 3.4 and 4.2.4).
Overally, in this work I stay with developing methods to minimize RMSE, but to
produce recommendations I correct the output of the methods (expected rating) by
a rough estimate of standard deviation of predictions, and by an adjustment for the
missing data structure (see section 3.4, and chapters 6 and 7). Users expect diverse
recommendations, and rewarding diversity can be built-in into a ranking-based ac-
curacy measure, or the calculated lists of recommendations can be postprocessed to
provide more diversity.
• Diversity.
Redundancy on a recommendation list is undesirable, as well as the list being too
monotonous (a term sometimes used is serendipity – ability to pleasantly surprise
the user). Recommendations serve partially as a discovery tool to help satisfy user’s
curiosity. Diverse recommendations increase perceived coverage, give an opportunity
to explore the space of items. Diversity makes recommendations less limited to the
group of items for which the user expressed his preferences earlier. If a recommender
system makes a mistake by proposing an item not liked by the user, it is undesirable
to have recommended similar items, for which it is likely that we have made the same
33
mistake. In other words, we do not want the errors for items on a recommendation
list to be correlated.
If several very similar movies have a high predicted score, for example, when there
are several versions of the same movie in the system, or unintended duplicates in the
database (frequent case in practice), then it is better to not occupy the recommen-
dation list with all of those similar movies, but to choose one or two of them. The
usual sorting by expected rating corrected for uncertainty does not ensure diversity.
Different ways are possible to increase diversity at the cost of probability of making a
correct recommendation (accuracy) [Kwo11]. The sorting can be improved by remov-
ing similar items from the recommendation list. A solution described in section 4.5
was to recommend whole clusters of similar movies. Different approaches to ensur-
ing diversity can be compared, for example by using measures as intra-list similarity
[Zie05], “personalization” [Zho10a], “surprisal” [Zho10a], or unexpectedness [Ada11]
(details in the respective papers). An untested criterion rewarding diversity to try
out is to maximize the probability that at least one item on the top-K recommenda-
tion list will be rated higher than a given, sufficiently low threshold (evaluating the
whole lists, taking into account dependencies between ratings of different items).
A related problem to ensuring diversity is detecting similar or equivalent items and
removing them from the recommendation list. It is obvious, that if e.g. the user rated
(implying he probably viewed) the movie, has read the book, bought the product,
then the same items or equivalent items should not be recommended to him, like
another version of the same movie, another edition of the book, an identical product
by another producer. Additional rules are, for example, that a movie sequel should
not be recommended before watching the first movie, different seasons of the same
TV series should be recommended in sequence, etc. Detecting equivalent items is a
domain-dependent issue. It is a separate machine learning task and the automatic
solutions may be not obvious.
It often happens that a user does not know what to do with a recommended item
entirely unknown to him, or is undecided about an item. Some solution to that is
ensuring temporal diversity [Lat10], that is, besides ensuring diversity inside a single
recommendation list, changing the recommendations over time.
34
application described in section 6.2.3, which recommends only the most frequent
2000 movies from the Netflix dataset. Usually we need to recommend the long tail
to some extent, for example, new items with few ratings need to be recommended,
so that they have a chance to eventually become popular. We can mark out the
notion of coverage of a recommender system – whether all items have a chance to be
recommended, and how often are they recommended. Similarly, there is a need to
balance between familiar content, that the user will like with high certainty, and the
content unfamiliar for the user, for which there is larger risk of making a mistake in
recommendation. One way to balance the recommendations is to tune the amount
of penalty for the variance of a rating prediction. Decreasing the penalty causes rec-
ommending more rarely rated content with uncertain predicted ratings: new items,
or relatively unknown or niche items. Users can have different individual preferences
for taking risks (or preferences, for example, for the surprise factor), and there may
be a need to adjust the balance individually for each user. This is similar to the risk
vs. reward relationship in finance.
New items go through a phase of having few ratings. Similarly, new users initially do
not have any ratings, and the items recommended first may be completely unknown
for them. Those cases are called cold start problems for users [Sch02a] or items. We
can also speak about a global cold start problem for the whole recommender system.
One idea to overcome the cold start problems is to use content-based recommenda-
tions in a case, when there are few ratings. The content used can be metadata about
items, like, for example, for movies it will be year of production, actors, writer, di-
rector, genre of the movie. For users the metadata can be age, gender, location, etc.
A recommender system may encourage to rate rare items (it would be best to recog-
nize which rare items have higher chance to be known by the user). Content-based
recommendations were used for items with no ratings in the application described
in section 6.2.3, and in section 6.1 we propose heuristic solutions for cold start situ-
ations.
In overcoming the cold start problem for users a specialized interface can help (so
it is done in the Netflix service), encouraging new users to quickly rate a number of
popular items, and allowing to express preferences about large groups of items, like
the whole genres.
• Choice of algorithm
After deciding on a criterion of evaluating recommendations remains the choice of a
best performing algorithm on the given criterion, for available data.
The choice of hold-out RMSE for evaluation by Netflix made developing machine
learning algorithms convenient, because it skips the usual non-trivial issue of domain
adaptation in real-life tasks, when the gathered data has a different distribution than
the data needed at the moment of usage. Another advantage is that using MSE loss
and approximating ratings with Gaussian distributions allows us to employ standard
linear models, like regularized linear regression. Improving accuracy of rating predic-
tion translates well to better realizing other goals such as better recommendations,
better item-item similarities, clustering, visualization, better user-user similarities.
To obtain good accuracy, we need to process a large collection of data, that is why we
have also pay attention to computational aspects of the method, as time complexity
and resources used.
We can observe, that no matter what kind of data about user preferences we have
gathered, most accurate algorithms for recommendations seem to always contain
dimensionality reduction. For prediction of movie ratings on the Netflix dataset, as
35
evaluated by hold-out RMSE, most accurate were variants of sparse matrix factor-
ization (regularized SVD) using appropriate regularization. Techniques of prediction
of movie ratings are shortly summarized in section 3.2 “Collaborative filtering”, and
thoroughly discussed in chapter 4. Chapter 6 describes an example use of regularized
SVD in two recommender systems.
• Temporal effects.
The basis of why collaborative filtering methods work is the assumption of item
persistence and user taste persistence, meaning that the properties of items and the
user preferences for items do not change by much, even during a longer period of
time. One might wonder, how exactly the model of data changes in time, for example,
how much does a rating made several years ago say about the present preferences of
a user. The evolution in time may differ for different types of items. For items like
movies, books, music, the model will likely change slower than for items that are
altered over time, like websites, or for items with short-term usefulness, like news.
Modelling the variability over time can significantly improve the prediction accuracy.
In the Netflix dataset of movie ratings, temporal effects improved RMSE accuracy
of all groups of methods [Bel07c, Kor09a, Kor09b, Tos09, Pio09]: matrix factoriza-
tions, RBM, K-NN, kernel methods. The effect that improved accuracy the most
was the single day user bias [Tom07, Pot08, Bel08] (the large single-day effect may
be partially a result of using single Netflix accounts by multiple persons). Other
useful effects were short-term and long-term variations in user biases, user prefer-
ences and movie biases, and modelling the so-called frequency effects [Pio09]. As
for the computational complexity, matrix factorization models with user preferences
changing in time have many times more parameters than the regular matrix factor-
izations, but the emerging models can be efficiently learned with gradient descent
[Bel08, Kor09a, Kor09b]. Easier is enhancing the distance-based methods, such as
item-item K-NN, by correcting the distances between items for difference of rating
dates [Tos08b, Tos09] (see also section 4.5.1).
Single day bias improves predictive accuracy measured by RMSE, but in applications
in recommender systems it may be not that important, because the bias is equal for
all items at the time of calculating recommendations. The day effect on user bias does
not directly influence the ordering of items, but its presence in the model can alter
other model parameters and thus slightly affect the quality of recommendations.
More important are short-term changes in user preferences. Ratings or likes and
dislikes are not a good way to indicate those short-term changes, and we should take
into account the current context of use of a service. To learn about user’s temporary
preferences it is better to rely on data such as the recent search query, clicked tag,
visited item page, or to give a user an opportunity to indicate his current mood (for
example, a preferred genre, or a genre to avoid).
To explain the nature of short-term and long-term temporal effects a psychological
perspective may be needed – why the user gives one rating over another. Some
psychological aspects of movie evaluation were considered in [Pot08].
Another issue is the need of temporal diversity of recommendations, already men-
tioned when we discussed diversity.
We can also verify if recommendations are stable [Ado10], that is whether predictions
do not change much over time when the user gives ratings agreeing with the previous
predictions made for him by the algorithm (note that the predictions made for all
items should depend on the missing data structure).
36
• Computational perspective.
A working recommender system can potentially serve millions of users, recommend-
ing possibly hundreds of thousands of items, and processing billions of ratings.
For this scale of volume of data processed special attention must be paid to how
much resources the recommender system uses, such as: the number of computers
realizing the recommender service, the number of running processes, the amount of
memory used, RAM and on hard-drives, the amount of network bandwidth occupied.
We may wonder how the amount of resources used scales with a growing number of
items, and with a growing number of users.
The perceived speed can be important for users. This leads to questions about the
recommender system: are the recommendations instant – is the new list of rec-
ommendations available imediately after rating the next item? How many lists of
recommendations the system can generate per second? How many items are on one
computed list of recommendations?
For new items sometimes it would be useful to use new ratings quickly to accurately
recommend this new item. We could inspect what is the latency of updating in-
formation about items – the information such as the movie variables in the matrix
factorization, or the distance between items in the K-NN algorithm.
Different versions of algorithms have different speed and resource usage. Algorithms
based on matrix factorization are usually faster than those based on K-NN.
Additionally, specialized caching may be used, to avoid calculating the same recom-
mendation lists multiple times.
• Security.
As any software, recommender systems need to be examined with regard to security.
Besides securing against common threats such as data leaks or denial of service
attacks, typical risks for recommender systems are shilling attacks [Lam04, Meh09],
and risks to users’ privacy.
A shilling attack means artificially influencing item position on recommendation
lists, by adding votes with synthetic profiles. Shilling attacks may be performed by
users who like or dislike an item, or by content owners who want to increase exposure
of their content or to put down competitive content. Such rating manipulations are
undesirable, because when ratings does not reflect the real preferences of users, the
quality of served recommendations can decrease (some call resistance to shilling
attacks as “trust” [Sch07] of a recommender system, but the notion of trust is also
used to describe social type of recommendations using opinions of friends or selected
people who we trust about their good taste [Ric10]). Shilling attacks can to some
extent be prevented by identifying malicious profiles through examining IP addresses
or browser fingerprints, or a larger weight can be put in the collaborative filtering
37
algorithm to votes of committed users, who gave many votes, made a purchase, wrote
reviews, earlier registered on the website, etc. Also, machine learning techniques can
be employed to identify atypical patterns in data.
Ratings in a recommender system can be publicly available or anonymous. If they
are anonymous, the issue of privacy emerges – determining if there is a possibility
of disclosing the ratings, and what may be the consequences of disclosure. Probably
there is no need to assure maximum possible privacy in recommender systems, as it is
in case of, for example, storing credit card data, or ensuring security of bank accounts.
If disclosing ratings of a very small fraction of users is possible, it rather is not a factor
disqualifying a recommender system. Some consider algorithms ensuring differential
privacy [McS09], meaning a setting when adding a new rating to a recommender
system does not allow to infer information about other ratings in the system. The
feature of differential privacy was not considered in the methods developed in this
work.
There was also discussion of privacy risks of the published Netflix Prize dataset.
Users from the anonymized Netflix dataset can be correlated with publicly available
ratings in other services, and as an example [Nar08] correlated two users from the
Netflix Prize data with authors of rated IMDb reviews. It is a good case to think
on privacy issues in recommender systems, and what happens when it is possible
to obtain ratings for some subset of users. In rare cases a user can have reasons to
rate publicly a fraction of movies he watched, but hide his ratings for some other
movies. To talk about a privacy breach or de-anonymization, many conditions have
to be met. First, a large number of users with publicly available ratings in an other
database should be identified in the Netflix database. Matching users is inaccurate
many reasons: there is inherent noise in data (user’s taste and mood change over time,
and re-rating experiments show [Ama09], that a user often gives different ratings to
the same movie), the Netflix Prize data was additionally perturbed, websites often
use different rating scales (1 − 5 for Netflix, 1 − 10 on IMDb), and the distribution
of ratings differs between websites. Because matching two large databases of users
is a multiple comparison task, much larger confidence is needed to determine if two
users are the same than in a case of comparing two users only. Even if a user is
with certainty identified in the Netflix database, because the user already showed
part of his ratings publicly, and because of all uncertainty with rating perturbation,
ratings changing after re-rating, tastes changing over time, and the possibility giving
a rating by mistake, it seems to be unlikely, that large fraction of users would object
to publishing their Netflix ratings. The selection of movies in the Netflix Prize data
rather does not contain highly controversial titles, for which users would have a really
good reason to hide ratings. Of course, as we will see in section 4.3.4, ratings tell
a lot about psychological profile of a user, so privacy concerns about releasing any
amount of ratings are justified.
The recently released Million Songs Dataset [McF12] contains song playcounts for
each anonymized user, but does not contain the song timestamps – the creators of
the dataset call the associated privacy risk “limited and reasonable”.
I have listed some of the most frequent issues considered when designing and devel-
oping recommender systems. The rest of the work is focused mainly on rating prediction
evaluated by hold-out RMSE. Chapter 6 describes the adaptation and deployment of the
developed prediction methods in two recommender systems.
38
3.2 Collaborative filtering
What is collaborative filtering? A short definition from Wikipedia is: “Collaborative filter-
ing (CF) is the process of filtering for information or patterns using techniques involving
collaboration among multiple agents, viewpoints, data sources, etc.”. The most common
use is predicting users’ preferences for items, used in recommender systems, that is why
collaborative filtering is often defined in the context of applying in a recommender sys-
tem: ”The collaborative filtering technique matches people with similar interests and then
makes recommendations on this basis“ [Mon03].
The idea behind collaborative filtering methods is intuitively obvious: there are groups
of users with similar tastes, meaning that they rate items similarly, so when we want to
recommend items to a user, items can be filtered through the tastes of similar users. The
concept of collaborative filtering evolved during its short history around the initial idea of
user-user similarities. One of the first applications was filtering a mailing list by multiple
moderators [Gol92], explicitly chosen by the user. Then appeared applications based on
gathering preferences, like ratings [Gol01], where users with similar taste were found auto-
matically and recommended was the content liked by the similar users. Most of the early
recommender systems were based on user-user K-nearest neighbors prediction [Mon03].
User-user similarities have an additional advantage of being intuitively understood, as,
for example, statistics summarizing how similar users rate an item are a well understood
explanation why the item was recommended [Her00b].
Further scrutinizing the collaborative filtering idea, the gathered user preference data
can be used, besides identifying similar users, to identify similar items. On the Netflix
Prize task item-item K-NN methods were more accurate than user-user K-NN. Item-item
similarities also allow to create local recommendations of type “if you liked this, you
will probably like that”, related to the item, which the user is currently interested in.
Because similar users tend to give near ratings to groups of similar items, we can predict
unexpressed user preference for an item, even when similar users have not rated the given
item.
The present understanding of the crux of the task is that collaborative filtering boils
down to predicting missing data in a sparse matrix. In this sense, collaborative filtering
is not just about recommendations and filtering content, but one can also speak about
applying collaborative filtering in any domain where there is a need of matrix comple-
tion (partially or fully), for example, collaborative filtering was used for medical datasets
[Has10] or educational data [Tos10].
For prediction of item ratings, the most accurate collaborative filtering methods found,
such as matrix factorization methods, base on dimensionality reduction. Intuitively, dimen-
sionality reduction methods can be understood as automatically learning hidden genres
of items and learning users’ preferences for these genres. [Ski07] lists the following inter-
pretations of matrix decompositions: factor interpretation as signals from hidden sources,
geometric interpretation as hidden clusters, component interpretation as underlying pro-
cesses and graph interpretation as hidden connections.
For the data and task in this work, the best matrix factorization methods found were
more accurate with similar computational requirements than the best near-neighbor meth-
ods. In contrast to K-NN based methods, in methods like matrix factorization we can say
that to predict one particular rating for one item and one user the whole rating matrix
is used – both similar users, and not-similar, both the movies he likes, and the movies he
does not like, and all movies liked or disliked by all other users.
Due to non-intuitiveness of the automatically learned genres, matrix factorization
based recommender systems are usually black-box-type, without explanation (or pseudo-
white-box, with partial explanation), while K-NN methods allow for intuitive, white-box
explanation of prediction values when the K parameter is small.
39
A useful framework encompassing the most accurate families of collaborative filtering
methods was multitask learning [Yu03, Bak03, Car97], where tasks are users (rows of
a sparse matrix), and task attributes are items (matrix columns). Similarities between
tasks and similarities between task attributes are captured using a hidden structure of
parameters, usually a set of parameters individual per task (user) and a set of parameters
describing task attributes, shared by all tasks.
The central theme of this work are accurate methods of collaborative filtering. The
content of the work and the taken approaches were determined by the chosen task: predic-
tion of held-out movie ratings in the Netflix Prize dataset, which contains over 100 million
ratings on scale 1 − 5, made by 480189 users who rated 17770 movies. An advantage of
the chosen dataset and the task of minimizing RMSE was its popularity, and in the work
I hope to summarize the present understanding of the problem and developments that
resulted from efforts of many people. As we will see further, for a real-life task relatively
many problems of statistical or mathematical nature appeared on the way.
It is difficult to identify precisely the model from which data come from, that is why
the best approaches were ensembles of many accurate, yet differing methods [Tak07a,
Wu07, Pat07, Bel07c, Bel08, Kor09b, Tos09, Pio09]. It should be noted, though, that in a
practical application to achieve satisfactory predictive accuracy it should be sufficient to
use one method.
The most accurate single methods calculated low rank matrix factorization with proper
regularization [Fun06, Har07, Sal07a, Bel07c, Tak07a, Lim07, Tom07, Sal08], usually using
30 to 200 hidden item and user features. Well performing were also appropriately tuned
variants of K-NN method [Tos08b], kernel methods [Pat07, Yu09a], and Restricted Boltz-
mann Machines [Sal07a]. Accuracy was improved by enhancing models by using implicit
information [Sal07a, Pat07, Sal07b, Bel07e], time information [Tom07, Bel08, Kor09a,
Kor09b, Tos09, Pio09, Xia09] and integrating models with K-NN [Bel08, Kor09b, Pio09].
40
test set was the final criterion of evaluating solutions after the end of the competition.
Submitted predictions were compared with the results of the reference algorithm Netflix
Cinematch, which had RMSEquiz = 0.9514 and RMSEtest = 0.9525. The Netflix Prize task
was to develop an algorithm better than 10% than the Netflix Cinematch, that is to reach
RMSEtest < 0.8572 = 90% ∗ 0.9225. In the next section “Evaluation” more is said about
the choice of RMSE as an evaluation measure, and about different RMSE’s for different
held-out test sets in the Netflix Prize dataset.
The Netflix Prize contest was launched in October 2006, and continued until June 2009,
when two teams crossed the 10% threshold. It was won by a 7-person team Bellkor’s Prag-
matic Chaos, and in accordance with the rules of the contest they described their solution
in the set of reports [Kor09b, Tos09, Pio09]. Parts of the final solution are described in
the previous Progress Prize reports [Bel07c, Bel08, Tos08b]. Over 5,000 teams submitted
their solutions to the contest. The solution of the author of this work was in final place
43rd, with score RMSEtest = 0.8717, that is 8.48% better than Netflix Cinematch.
In addition to the training, probe and qualifying sets, the Netflix Prize data includes
the file movie titles.txt, containing the movie titles and the years of release. There were
attempts to use those data for prediction [Tak07a, Tos08b], but there was no accuracy im-
provement, or the improvement was minimal. [Tos08b] used the release year, and [Tak07a]
tried using similarities between titles. Also, there were many attempts to use additional
metadata about items, besides the release year using features such as: genres, actors,
writer, director, gathered from datasets like IMDb [Pil09b], or Wikipedia [Lee08], but it
turned out that ratings contain enough information, and additional information about
movies does not improve prediction [Pil09b]. I have also unsuccessfully tried to improve
prediction accuracy in the Netflix task by using metadata, though I used IMDb metadata
to predict movie features for new movies (see sections 4.7 and 6.2.3).
The intention of focusing on one dataset and task is understanding the dataset well
and developing possibly most effective prediction algorithms. A plus of the choice of the
dataset and task of Netflix Prize is large interest among professionals, which ensured good
quality of developed solutions. An advantage is also that it is a large collection of data, at
the time of writing this work it is the second largest publicly released set of item ratings
(the largest one being the KDD Cup’11 Yahoo! dataset with over 300M ratings for over
600K items). Other popular datasets of movie ratings are EachMovie, containing 2,811,983
ratings and MovieLens containing currently three sets: 100K, 1M and 10M ratings.
About the engineering matters, doing calculations on data of size 100 million ratings
is on boundary of the capability of present-day personal computers, and an adequate
representation of data was important. Using a traditional SQL database would be too
slow, that is why the role of a database was fulfilled by appropriate C++ data structures.
I decided on using a simple structure, an array with 100 million 3-byte records – one byte
for a rating, two bytes for a movie ID:
or additional two bytes for the day when the rating was made:
The array is indexed by additional arrays, pointing where each user’s ratings are placed.
Users and movies were sorted according to the number of their ratings.
To avoid parsing the data each time an algorithm was started, the array was allocated
with a mechanism of mapping a file to memory (mmap function of the Linux operating
system). This data structure was good enough for experiments. The memory occupancy
was about 300MB in the first version and about 500MB in the second version containing
the date information. The algorithms described in this work most often took 5-15 hours
41
on average PC, for learning parameters and computing predictions. Because the majority
of algorithms were iterative and they had to repeatedly iterate reading data, some teams
experimented with compressing the data structure [Tak07a], which allowed to further
accelerate the algorithms.
A few times in the work I used a small subset of the Netflix Prize data, that was more
dense, to be suitable for testing algorithms that use missing data imputation. For the
dense dataset 500 most popular movies were selected, and 10059 users who rated at least
250 from the 500 most popular movies, but rated at least 285 from the remaining movies.
This criterion was chosen to filter out users, who rate too many movies, supposedly rating
many movies they had not seen, and to counter overrepresentation of users with mass taste
(users who like popular movies). The resulting dataset is a matrix 10059 × 500 of ratings,
which is over 56% dense, in comparison to the 1.1% dense whole Netflix Prize dataset.
Methods were compared using the following evaluation: 304, 500 ratings were randomly
selected to the test set, so that the remaining training set contained exactly 10059 ∗ 500/2
ratings, being exactly 50% sparse. RMSE calculated on a random draw of the test set (the
same test set for all evaluated algorithms) is called RMSEsmall (all RMSE versions used
in the work are listed in the next section 3.4 “Evaluation”). Evaluation by RMSEsmall
could be improved by using an cross-validation-like algorithm, if needed, by averaging
over several draws of the test set.
3.4 Evaluation
The previous section introduced the Netflix Prize dataset containing the files training.txt,
probe.txt, qualifying.txt and movie titles.txt. As a training set in my experiments the
data from training.txt was used with 15% of probe.txt excluded. On this training set each
implemented method was trained once. The excluded portion of probe.txt was used as a
test set to evaluate methods, tune parameters of the methods, blend predictions, and to
observe how well different kinds of methods combine with each other (and draw conclusions
from that).
The task proposed by Netflix was to predict ratings from the test set distribution
(probe.txt and qualifying.txt), and evaluate predictions by comparing root mean squared
error (RMSE): sX
1
RM SE(ŷ) = (rij − r̂ij )2
|T e|
ij∈T e
where T e is a chosen hold-out set, that is a subset removed from the data while training
(later we list different hold-out sets used). The probe and qualifying sets consisted of up
to 10 most recent ratings of all users (for most users exactly 10 ratings), so they have
a different distribution than the training set, but in experiments there was close to no
difference between using the data from the probe set or from the training set for hold-out
evaluation.
I will discuss advantages and imprecisions of the choice of RMSE minimization on a
hold-out set, from a perspective of using the resulting rating predictions in a recommender
system. Optimizing RMSE usually results in simplest and fastest algorithms, in comparison
to other possible criteria evaluating prediction of ratings, or criteria evaluating personalized
rankings of items directly [Wei07]. For example, if we want to approximate a set of values
with one number, minimizing RMSE will give the average, and minimizing MAE (mean
average error) will give the median, which is more complicated to compute.
The task of calculating recommendations is different from predicting ratings, but, as
argued later in the work (sections 4.2.4, 6.1 and 6.2.3), the methods predicting ratings,
developed to minimize RMSE on the held-out observed data, can be adapted to obtain
good quality recommendations (for example, it is certain, that to generate recommen-
42
dations we need to adjust the predicted expected ratings by uncertainty of predictions).
Let’s look closer at how rating prediction on the observed data distribution is related to
the task of calculating recommendations, which can be seen as a classification task – to
identify relevant items. For example, accurate estimation of user bias (or user mean) is
important when we predict ratings, but when we calculate recommendations, user bias
is the same for all compared items, on the other hand, item bias is very important for
calculating recommendations (and also, is prone to missing data effects). Perhaps we can
find a better criterion to optimize methods than RMSE, and find better criteria to evalu-
ate recommendations than the typically used ranking-based criteria calculated on ratings
from the observed data distribution, as for example, top-K precision, NDCG (normalized
discounted cumulative gain), etc.
Let’s assume that we have a method that, for a given user and item, outputs one value,
the predicted rating. Let’s assume that the user gives a rating 5 for items that are relevant
for him, and we want to identify the unrated relevant items to put them on the top-K
list of recommendations for the user. We can define a loss function that weights different
kinds of errors made by our prediction method. The following table lists my estimation of
how large the cost of error should be for several possible cases:
If the algorithm inaccurately predicts 5 when the real rating is 1, it will cause us to
make a bad recommendation, likely leading to a negative experience for the user, hence
the cost of such case should be large. If we inaccurately predict 1 when the real rating is
5, we are just skipping a relevant movie, without a negative experience for the user (but
some users may accept more risk if it leads to a better chance of discovering a relevant
movie). We can say, in an intuitive sense, that precision is much more important than
recall. Perhaps, to simplify, we could skip entirely evaluating recall, and evaluate only
precision (if we only consider relevant vs. irrelevant classification, precision and recall will
be equivalent for a fixed K). Looking at the table, it seems that evaluating the calculated
1 P
top-K lists of recommendations by K i∈top−K (5 − ri )2 could work well – let’s call this
measure MSE@top-K. In practice we want diverse recommendations over time, and we can
vary the K parameter, and select items for recommendation from a larger set, increasing
the weight on recall. Typically, K is small, and in consequence we feel that if we skip
a relevant item on the recommendation list, it can always be replaced by a slightly less
relevant one, but in reality we should also reward recall – recommendations should be
diverse, and the interface of the recommender system should provide options like search,
categorization, meta-data navigation, rating groups of items, tagging, to give the user
opportunities to visit and rate items outside of the preferences declared so far.
Above we considered methods predicting the expected rating, but we can do better with
methods that additionally are able to assess the uncertainty of predictions made. Generally,
for movies with more ratings we get more accurate rating prediction than for movies
with few ratings. This fact can be used to improve recommendations. While predicting
ratings, the appearing local posterior distributions of ratings can be approximated well by
Gaussians N (µ, σ 2 ), clipped to range [1, 5], and, if necessary, rounded (see section 4.2.2).
Having the estimated posteriors N (µi , σi2 ) for each movie i, in order to minimize expected
MSE@top-K R 5we should choose to the top-K list the movies with the highest score function:
s(µi , σi ) = 1 (5 − x)2 N (µi , σi2 )dx + Φ((1 − µi )/σi ) . Because we already made several large
simplifications in the above considerations, and the whole reasoning is inaccurate, it is all
43
right to further simplify the above expression. In this work I decided to score items for
recommendations by s(µi , σi ) = µi − Cσi . Because selecting K items from a larger set is
a situation of multiple comparisons, the constant C increases with the number of items
considered. With a typical “long tail” distribution of item ratings there are few items with
low σi and many items with high σi .
We know more or less how to produce recommendations from predictions for all items.
The hatch is that the methods learned on the training set produce accurate posterior pre-
dictions for the training set distribution (items likely rated by the user), but predictions
for all items (items selected for rating uniformly at random) typically are largely biased.
Obtaining accurate posterior distributions for all items is a domain adaptation task, for
which we need additional data where users are forced to rate randomly selected movies
(ideally with an additional option to indicate that the user has not watched the movie,
instead of giving a rating). The Netflix Prize dataset does not contain additional non-user-
selected data, as it is provided, for example, in the Yahoo! Music R3 dataset, and I rely
here on using heuristics to correct the methods trained on ratings for user-selected movies
only. The proposed heuristics (section 6.2.3) penalize distance (dissimilarity) from the set
of movies rated highly by the user. Logically, if this distance increases, the mean µi should
decrease, and σi should increase, but a more precise relationship and the best distance
to use need to be determined on additional data (rated, randomly selected movies). Note
that the structure of missing data may originate from various causes unrelated to user
preferences, for example, it can depend on which items are exposed by the website naviga-
tion or by the interface of the recommendation service. In consequence, the relationships
between the probability of the rating being missing, and the parameters µi and σi , can
differ for different groups of items.
To further illustrate the problem of using data for recommendations let’s look at a
simplified case. Assume that the set of items for a typical user in a typical recommender
system can be divided into five parts: R5, R1, U5a, U5b, U1. By R5 we denote the set of
items rated 5 by the user (let’s say, about 100 ratings). By R1 we denote the set of items
rated 1, which typically is smaller than R5 (let’s say, about 10 ratings). By U1 we denote
missing data, which in reality would be rated 1 or marked as not known to the user (about
10,000 ratings). By U5a we denote missing items, that would be rated 5, and are similar
to the set of items rated by the user (about 100 ratings). By U5b we denote missing items
that would be rated 5 (after watching, in the case of movies), and are not similar to the set
of rated by the user, but are from regions (of the universe of items) unknown to the user
(about 1000 ratings). Now, with this division, the recommendation task can be defined as
predicting in the missing data set (U5a,U5b,U1), which items belong to U5a (if the user
expects “safe” recommendations, which he can immediately evaluate), or which belong to
U5a+U5b (if the user is interested also in content unfamiliar to him, which he needs to
examine before evaluating, watch the movie, etc.). Note that if users are forced to rate
unfamiliar content, as it is the case of the dataset Yahoo! Webscope Music, the gathered
ratings will be lowered comparing to how users would rate the items after they familiarize
with them. We can expect, that the algorithms described in this work (minimizing RMSE
for held-out ratings from the data distribution) will predict a low rating anyway for the
items in the U1 group, so items from U1 are unlikely to enter the calculated Top-K lists
of items. Experiments on datasets with MCAR data (missing completely at random; non-
user-selected) are needed to confirm or deny this guess. If the missing data influence has
the form of a systematic negative bias, which affects all missing ratings from the U1
group in a similar way, taking it into account will not impact recommendations largely. If,
however, the negative bias largely differs between groups of items, it should be modelled
to improve the quality of recommendations. Similarly, as we mentioned earlier, we should
also estimate changes in uncertainty of predictions. Another question is, how important
44
for quality of recommendations is distinguishing U5a from U5b.
In [Mar05] models CPT-v and LOGIT-vd were proposed, which correct dimension-
ality reduction methods by additional global biases, that adjust the models to predict
ratings that do not come from the data distribution. Those methods need for learning
an additional small amount of rated non-user selected items. [Mar08] also proposed a
method cRBM/E-v, which modifies Conditional RBM [Sal07a] by adding five global bi-
ases active for non-user selected items, one for each value of a rating. As experiments
show [Mar08, Mar09], the biases used in methods cRBM/E-v, CPT-v, LOGIT-vd to a
large degree correct the underlying algorithm trained on user selected data, so that it has
good prediction accuracy also on non-user selected data (such as uniformly missing items).
The correction by simple global biases should not influence much the relative ordering of
items, but it is possible, that a more complex and accurate model of correcting for the
situation of items uniformly missing would have impact on recommendations. An obvious
possible extension of the idea of modified biases is explaining the users’ tendency to give
ratings for movies they like (and rate high). We can deduce, that the reverse relationship
will be analogous, and the lower is the rating predicted by our algorithm, the larger is
the additional negative bias – modelling that negative bias should not largely influence
recommendations. Paradoxically, the artificial example given in table 1 in [Ste10b] is close
to MAR (missing at random – p(Selected|Data) = p(Selected|Dataobserved )), illustrat-
ing that in some cases a lot about the missing data structure can be deduced from the
observed ratings. It would be difficult to prove that real-life data in a collaborative filter-
ing setting is NMAR (not missing at random), because there is always a possibility left
that an undiscovered relationship between the observed and the missing data will explain
completely the probability of missing data, rendering the unobserved data useless). It is
reasonable to suspect though that, in the Netflix dataset, part of the structure of missing
data is unexplainable by the observed data.
The evaluation measure suggested by Netflix, RMSE on the set of most recent rat-
ings, would probably do a good job evaluating a deployed recommender system, balancing
between evaluating precision (by measuring accuracy for movies proposed by the rec-
ommender system) and recall (by measuring accuracy for movies found by the user, for
example, through search). After deployment, the training set changes with new ratings
added over time, and the distribution of ratings in the whole set (together with the dis-
tribution of which items are selected to rate) over time becomes similar to the set of
most recent ratings. For a recommender system before deployment, RMSE on the most
recent ratings from the database does not necessarily give accurate answers for questions,
whether the system does not recommend dubious items (does it have high precision?)
or whether it identifies the relevant items among all items (does it have high recall?).
Other commonly used evaluation measures (including the ranking-based) calculated on
gathered user-selected ratings have similar disadvantages. The evaluated prediction algo-
rithm can produce any number of false positives on the missing data, and if we measure
error on observed data, without accounting for the missing data structure, the real error
can be unrelated and arbitrarily high. Attempts to define a measure evaluating person-
alized rankings of items on data distribution are not satisfying so far, like for example,
the algorithms CoFiRank [Wei07] (does not take into account the missing data structure,
and the optimization is complicated), Bayesian Personalized Ranking [Ren09], or the task
Track 2 of the KDD Cup 2011 [Dro11, McK11, Lai11, Jah11b, Mni11], which evaluates
how well the algorithm distinguishes high ratings from missing data sampled according
to the frequency of items (an evaluation criterion rewarding both predicting high rat-
ings, and predicting the missing data structure). Skipping information contained in low
and average ratings is controversial, as well as highly rewarding accurate prediction of
the missing data structure, which can depend on the mechanism of website navigation,
45
website layout, previously given recommendations, and other causes that not necessarily
indicate user preferences (for example, currently the search option on YouTube weights
highly the number of views, instead of giving more weight to likes and dislikes, and, in ef-
fect, videos with a misleading title, tags, and thumbnail because of traffic from search can
have millions of views). Another approach [Ste10b] is treating missing data as additional
ratings ≈ 2, weighted about 20 times less in regression than the regular, observed ratings.
– the observed improvement of ATOP (area under the top-k recall curve) accuracy of this
method should be attributed to penalizing unpopular items, and hence rather attributed
to accounting for variance of predictions than to accouting for missing data.
Said all that, because I do not have additional data, in this work I stay with evalu-
ating methods on the Netflix Prize criterion of RMSE on the test set, hence obtaining
methods that predict the expected rating on the observed data distribution. Predictions
of these methods, to produce accurate recommendations, need to be adjusted to account
for uncertainty of predictions and for the missing data structure. The adjustments are
heuristic, and likely need to be tuned on additional data. Additional data can be also
used to develop algorithms with domain adaptation, that optimize the proper evaluation
criterion (which needs yet to be identified) directly. I anticipate that such algorithms with
domain adaptation will be close to the algorithms developed in this work.
For different choices of the hold-out set, RMSE will differ. I will list all RMSE vari-
ants appearing in this work, and also RMSE variants that often appeared in other works
dedicated to the Netflix Prize task:
• RMSEqual - one half of qualifying.txt (which has the same distribution as probe.txt)
is used as the hold-out set. This evaluation score, called also the quiz score, was
reported by the automatic system evaluating the Netflix Prize submissions, and is
the most common kind of RMSE score appearing in papers describing solutions for
the Netflix Prize task. We should emphasize, that including probe.txt in the training
set largely improves the RMSEqual score.
• RMSEtest - another half of qualifying.txt set (the rest of the 2, 817, 131 ratings after
excluding the quiz set). RMSEtest was used as the final score in the Netflix Prize
competition. RMSEqual was lower from RMSEtest by 0.0006-0.0014, the largest dif-
ference being for those of top teams, who combined their solutions using quiz scores
(RMSEqual ) for individual methods, this way causing overfitting.
• RMSEprobe - probe.txt, having 1, 408, 395 ratings, is used as the hold-out set. The
probe and the qualifying set come from the same distribution and contain max.
9 latest ratings of each user (randomly split in 1:2 ratio between the probe and
qualifying set, and the qualifying set being further split to the quiz and test set).
Some teams used probe.txt as a hold-out set for initial training of methods and
for computing blending parameters, with further retraining each method on the
whole training.txt set, and combining the resulting predictions using the blending
parameters calculated earlier on the probe set. There are differences in distribution
of probe.txt (and qualifying.txt), and the rest of training.txt without probe.txt .
The most visible differences are in the distribution of user support (see section 3.5
“A closer look at the data”), the distribution of rating dates, and the difference
between the averages of the training and probe set (section 4.2.1), but, with minor
exceptions, no way was found to use these differences to improve accuracy of more
complex methods.
• RMSE15 - is the RMSE score most frequently appearing in this work. Using the
whole probe.txt as a hold-out set was an inefficient use of data. A better idea was
to include most of probe.txt in the training set. The chosen proportion 15% (211365
ratings) was a balance between using the largest possible hold-out set to blend many
methods, and using the largest possible training set, to obtain good predictive accu-
46
racy. The same RMSE15 score was used in the earlier paper [Pat07]. An advantage
of using RMSE15 is that methods are trained only once, while not losing too much
accuracy for removing 15% of the probe set for training (it was much better than
removing 100%). RMSE15 of individual methods was lower by 0.0025 − 0.0035 than
RMSEqual . The difference was larger for the blending, because it introduced addi-
tional overfitting.
• RMSE10 - a convenient idea of Gravity team was such drawing the 10% of probe.txt,
so that RMSE10 ≈ RMSEqual [Tak08a].
• RMSEtrain - one can speak also about measuring RMSE on training set, but this
value does not mean much, because of the occurring overfitting the data. Because
the degree of fitting the data varies between the methods (especially apparent in
kernel methods, which fitted the data very closely), RMSEtrain cannot be used to
compare methods or to tune parameters of a method. In this work training RMSE
will not be reported.
• RMSEsmall - was calculated on a subset of the Netflix data, described in the previous
section 3.3. The subset contains 10, 059 users and 500 movies, and is 56% dense. Of
it, 10059 ∗ 500/2 ratings are used as the training set, and the remaining ratings
become the test set, for which the RMSE is computed.
We should note that a RMSE value is data specific, and of course there is no point to
compare RMSE computed on different datasets, or to give RMSE of a single method on a
rare or unpublished dataset (this value will not tell much to the reader about the accuracy).
RMSE computed on a fixed hold-out set serves only to compare different methods on
the same dataset. One advantage of focusing on prediction for the Netflix Prize task is,
because of its popularity, the possibility to compare results with many methods developed
by others.
In general, instead of using one hold-out set, it is better to use cross-validation, that
is, to repeat learning for several hold-out sets, and average the results. Because of the size
of the Netflix Prize dataset, most methods took long time (5-15h) to learn, and it was
more convenient to run the methods only once on a chosen single hold-out set.
The Bellkor team noticed, that it is possible to blend the methods learned on all
available data, without an additional held-out test set [Tos09]. The trick is to use RMSEqual
for each method, returned by the automatic Netflix Prize evaluation system. The remaining
statistics used to calculate the linear blend are possible to compute on the training set.
Imprecisions in the choice of task (and the awareness, that the task is not 100% relevant
for the goal of building a good recommender system) are not a large obstacle, because
the main goal of this work is developing and systematizing general principles, insights,
methodologies for solving a real-life prediction task. The task of minimizing RMSE on a
hold-out set is, in a way, elegant because of its minimalism and simplicity. It is “clean”
for a task concerning real-life data. Throughout the work we will focus on the aspect of
prediction of held-out data, and we will assume, that RMSE on held-out data is a choice
good enough as a criterion of evaluating accuracy. If (unlikely) it would turn out, that the
modelling the missing data structure is of large importance for quality of recommendations,
it does not matter much for conclusions in this work, because the center of our interest
is closer to searching for most accurate prediction methods for a fixed task (to better
understand the prediction domain, generalize, develop general approaches for different
prediction tasks), than obtaining best possible usability in one fixed real-world application.
Summarizing shortly, we will minimize RMSE for held-out observed data. This cri-
terion will allow us to sufficiently explain the structure of data. To use the developed
methods in a recommender system it is necessary to correct the expected predicted rating
by penalizing large variance of prediction (for example, by subtracting the multiplicity of
47
estimated standard deviation of prediction, see sections 4.2.4, 6.1, 6.2.3). In my experience,
using an accurate method optimizing RMSE, like the regularized SVD, with correction
for variance of predictions is good enough to create a reasonably accurate recommender
system. Additionally, penalizing the likely missing data (for example with the adjustments
described in section 6.2.3) can possibly improve recommendations.
Tr
Pr/Qu
0.25
0.20
Frequency
Frequency
0.15
0.10
5000
0.05
0.00
The distribution of gathered user preference data can differ considerably between dif-
ferent datasets. First, different types of user preferences can be gathered. Most popular
are ratings, possibly with an additional option “not interested”. Other popular interfaces
48
are “like” buttons, or a binary like vs. don’t like choice. Ratings may be collected on dif-
ferent scales: 1 − 5 or 1 − 10, on an integer scale, or with 0.5 steps. If ratings are collected
on the same scale, the distribution can differ depending on the type of items, the way of
data collection, the user interface, the choice of items presented to a user for rating, or
depending on whether are they ratings from a recommender system, ratings from voting
for top items, or ratings given to help in determining a collective opinion about an item.
The article [Mar09] compares distributions of ratings in several datasets: the Yahoo!
dataset of music ratings, and the EachMovie, MovieLens and Netflix datasets containing
movie ratings. The Yahoo! dataset contained both user-selected items, and items selected
randomly. In the Yahoo! dataset in the user-selected part most frequent were ratings 1
and 5, on a 1 − 5 scale, but for randomly selected items the distribution of ratings was
very different: more than 50% of ratings were ones, and less than 5% of ratings were fives.
Clearly, it is a large difference from a scenario of data missing completely at random –
there is a large bias towards selecting for rating items liked by the user (and highly rated).
In a real recommender system predictions are usually made for all items, so predictions for
a missing completely at random (MCAR) setting seem more appropriate than predictions
for user-selected data (on the other hand, it may be possible to sufficiently explain the
missing data mechanism on the basis of the observed data, because users tend to rate
movies they like). The Netflix Prize data provides only user-selected test data, so we
cannot observe distributions and evaluate the trained methods on data randomly missing,
but the phenomenon of selection bias can be observed indirectly, as seen later in figure 15,
that shows decreasing average user rating with increasing number of movies rated. The
conclusion from figure 15 is that if a user were forced to rate all movies, his average
rating would be very low. It requires further analysis to say if recommendations in a real
recommender system should be corrected by decreasing the predicted rating for unlikely
seen items. Such a correction was used in a recommender system described in section 6.2.3.
Let’s look at the distribution of the number of ratings of each user. In the training set
there are few users who rated many movies and there are many users who rated few movies.
The histogram of logarithm of user support in figure 4 resembles a normal distribution.
Figure 5 shows the number of users with given support in combined training, probe and
qualifying sets. Figure 6 is a zoomed version of figure 5 for users with support < 100.
Figure 5: Count of users with given sup- Figure 6: Count of users with given sup-
port (support <with
Count of users 1000)
given support (support < 1000) port (support <with
Count of users 100)
given support (support < 100)
4000
4000
Count of users with given support
3000
3000
2000
2000
1000
1000
0
0
Table 1 shows the number of ratings in the set Pr+Qu. Most users have exactly 9
ratings in the Pr+Qu set, so we can say that, in our task of minimizing RMSE on the test
set, the accuracy of rating prediction of each user is equally important. In contrast, for
49
Table 1: User support in Pr+Qu
1 2 3 4 5 6 7 8 9 Sum
3411 2572 2230 2237 2346 2509 2940 3527 458417 480109
the calculated on the training set RMSEtrain , users who rated more movies have a larger
share in the mean squared error.
Now let’s look at the distribution of the number of ratings of each of 17, 770 movies.
We can distinguish popular, mainstream movies and movies from the “long tail”. Table 2
shows the most popular 10 movies. Figure 7 is a histogram of the logarithm of movie
support in the training set, and figure 8 is the same histogram for the Pr+Qu set. Unlike
for users, for movies the two sets have similar a distribution. In the task of minimizing
RMSE on the qualifying set more important is accurate prediction for a popular movie
than accurate prediction for a rarely rated movie.
1000
1000
800
800
Frequency
Frequency
600
600
400
400
200
200
0
101 102 103 104 105 100 101 102 103 104
Movie support Movie support
The observed shape in figures 7 and 8 suggests that movie support follows a power
law distribution, but the log-log plot (figure 9) shows a relationship different from linear
50
between the logarithm of the rank of movie support and the logarithm of movie support.
In figure 10 a relationship close to linear was obtained by plotting the square root of
movie rank against the logarithm of movie support. Hence the law that explains the
behavior of the long tail in
√ the Netflix Prize data is described by a distribution with pdf
approximately p(n) ∝ e −a n , where n is the movie frequency rank, and p(n) is the movie
frequency in the dataset, so a different distribution than the commonly observed power law
distribution p(n) ∝ n−a . It should be noted here, that Netflix added new movies to their
offer gradually. The Netflix collection contained about 17770 DVD’s in 2005 year, available
to rent and rate, comparing to 4470 movies in 2000 year [Tan09] (we can speculate, that
it was not a uniformly random subset of the 17770 movies, but biased towards selecting
more popular movies – popular movies were available for a longer time, further amplifying
their popularity).
Figure 9: log(movie rank) vs. log(movie Figure 10: sqrt(movie rank) vs. log(movie
support) support)
Table 3 shows the distribution of users and movies in the training set. Users were split
according to their support (the number of ratings) into 10 equinumerous groups. Similarly,
movies were split into 10 groups according to their support.
The already mentioned long tail situation is visible: in the training set the 10% most
frequent users gave 44.04% of all ratings, and the 10% most frequent movies received
76.66% of all ratings.
Table 4 shows an analogous distribution of users and movies in the Pr+Qu set. Users
and movies were split into groups according to their support in the training set.
Next plots explore the date coordinate. Figure 11 shows count of ratings on each day
51
in the training set. Figure 12 is the same as the previous one, but for the Pr+Qu set.
Figure 13 shows a histogram of the logarithm of the number of ratings on each day, in the
training set.
Figure 14 shows the number of ratings on each weekday. Most ratings were given
on Tuesdays, and least ratings on Saturdays and Sundays. This contrasts with traffic
statistics of the Netflix.com website, which show that the highest traffic on the website is
on Saturdays and Sundays (this can be explained by the fact that in the Netflix system,
after a user returns a DVD, he gets an e-mail encouraging to rate the movie).
Table 4: Distribution of users and movies in the probe and qualifying sets.
1 2 3 4 5 6 7 8 9 10 Sum
1 5.87% 2.13% 0.87% 0.53% 0.34% 0.18% 0.14% 0.09% 0.06% 0.06% 10.26%
2 6.89% 1.69% 0.66% 0.37% 0.23% 0.14% 0.10% 0.06% 0.05% 0.05% 10.24%
3 7.26% 1.47% 0.58% 0.35% 0.22% 0.13% 0.09% 0.07% 0.05% 0.05% 10.27%
4 7.47% 1.32% 0.56% 0.32% 0.21% 0.13% 0.10% 0.07% 0.05% 0.05% 10.27%
5 7.59% 1.24% 0.53% 0.33% 0.21% 0.13% 0.10% 0.07% 0.05% 0.05% 10.30%
6 7.63% 1.21% 0.54% 0.33% 0.21% 0.13% 0.09% 0.06% 0.05% 0.05% 10.30%
7 7.90% 1.11% 0.50% 0.30% 0.19% 0.12% 0.09% 0.06% 0.04% 0.04% 10.35%
8 8.15% 0.99% 0.45% 0.26% 0.17% 0.11% 0.07% 0.05% 0.04% 0.03% 10.34%
9 8.21% 0.92% 0.42% 0.26% 0.18% 0.11% 0.07% 0.05% 0.04% 0.03% 10.30%
10 5.57% 0.75% 0.38% 0.23% 0.15% 0.10% 0.07% 0.05% 0.04% 0.03% 7.38%
Sum 72.54% 12.84% 5.47% 3.30% 2.10% 1.28% 0.92% 0.64% 0.48% 0.43% 100.000%
Figure 11: Count of ratings in time – train- Figure 12: Count of ratings in time – probe
ing set and qualifying set
Table 5 shows the percentage of ratings given each year in the training set, for each of
10 equinumerous groups of users, split according to the decreasing number of ratings of a
user. Ratings from 2005 dominate the dataset, taking between 43% and 72% of all ratings
in different groups of users. Table 6 is the same, but for Pr+Qu data.
Because this paper addresses mainly rating prediction, we will look next how ratings
change along with the remaining three variables: users, movies and date.
Figure 15 shows the average rating changing with the user support. We see that the
average decreases with increasing number of ratings given. A fitted linear relationship is
marked on the plot (note that on the right side of the plot it may be inaccurate). Assuming
that the average decreases approximately linearly with the increasing user support, we
can try extrapolating the average on a situation, when a user rated all 17, 770 movies. In
this case it seems that the mean user rating would be less than 2 on average. A similar
pattern with the average user rating decreasing with increasing number of ratings was also
52
Figure 13: Histogram of log(ratings per Rating count per weekday
Histogram of log(ratings per day) in the training set Figure 14: Ratings count per weekday
day) in the training set
140
120
10,000,000
100
Frequency
Frequency
80
60
5,000,000
40
20
0
0
Table 5: Distribution of the rating year in the training set, for users grouped by support
(10 groups)
Table 6: Distribution of the rating year in the probe and qualifying sets, for users grouped
by support in the training set.
observed in [Dro11] in a dataset with music ratings. The drop of average can be explained
by a strong tendency of users to rate items they like. If a user rates more movies, then he
is forced to evaluate movies he does not like and his rating average decreases. The article
[Mar09] describes the case of Yahoo! Music data, where with uniformly random selection
53
of an item for rating, the most frequent rating was the minimal rating 1.
n =357k n =13k
4
● ● ●
●
●
Mean of user means
n =357k
42
● ● ● ●
● ●
● ● ●
●
0 1000 2000 3000 4000 ● 5000
2
User support
1
Figure0 16 shows,
1000 that2000
the log
average
3000 rating
movie 4000 increases
support vs. movie with rating
5000 mean increasing movie support. The
relationship between the average rating and the logarithm of movie support is approxi-
User support
5
mately linear.
of movie means
●
log movie support vs. movie mean
●
rating
● ●
●
●
●
means
Mean of movieMean
42
● ●
● ●
● ● ●
● ●
31
●
●
●
64 128 256 512 1024 2048 4096 8192 16k 33k 66k 131k 262k
2
Movie support
1
64 128 256 512 1024 2048 4096 8192 16k 33k 66k 131k 262k
Movie support
Figure 17 shows the average rating in time. We notice a jump of average rating in early
2004, which becomes explained by modelling the movie mean. A probable explanation for
the jump is adding highly rated TV series and movies to Netflix’s selection in early 2004
(I have not verified this presumption). Another possible explanation are changes in the
Netflix’s rating interface (also, not verified).
3.5
3.4
3.3
1,000,000 ratings
3.2
3.1
Time
54
I have listed the first, basic visualizations and summaries of the dataset. As it will be
seen in later sections, only some of these plots are relevant for the task of maximizing
predictive accuracy, that we are primarily interested in. Nevertheless, presenting and com-
menting the basic effects, easiest observed in data consisting of just four variables: ratings,
users, movies, and date, can be helpful.
The simply put task of maximizing predictive accuracy by modelling
E(rating|user, movie, time) or E(rating|user, movie) is deeper. As we shall see later in
this work, most important for accuracy is modelling a large number of hidden variables,
capturing such properties of data as similarity between items and similarity between users,
mainly through dimensionality reduction. Different ways of explaining hidden structure of
matrix or tensor-type data of this type constitute the field called collaborative filtering.
55
“Free your mind Luke Skywalker”
vzn
Described will be mainly (with a few exeptions) methods that improved the accuracy
of the ensemble – and these were fewer than 10% of all implemented variants of methods.
While developing the methods I did not have in mind the didactic perspective, but only
the ensemble performance.
First, the multidimensional regularized SVD model (called also a matrix factorization)
will be thoroughly described. In the regularized SVD the hidden per-movie variables rep-
resent automatically learned analogues of movie genres, and the hidden per-user variables
state the user’s preferences for the automatically indicated genres. SVD results can be
postprocessed by other methods to improve accuracy, and also SVD features have appli-
cations beyond prediction: to find similar movies, similar users, cluster movies or users,
visualize, suggest new genres, etc. (see section 4.3.4 and chapter 6). I will discuss exploit-
ing the missing data structure, and using time effects to improve accuracy. I will briefly
summarize the matrix norm regularization form of the regularized SVD.
56
Next I move on to a short description of nonlinear methods, among which most efficient
were Restricted Boltzmann Machines. Described will be distance-based methods, such as
K-nearest neighbors and kernel methods like Gaussian Processes [Ras03, Ras06], and a
few other methods, for example, K-means.
There were attempts to augment the Netflix Prize dataset with external item metadata,
but it turned out, that ratings in the Netflix dataset carry a sufficiently large amount of
information, so that additional item metadata did not improve prediction. I will write
about my attempts of using the IMDb and Wikipedia metadata.
Finally, I will review different ways of combining prediction methods: preprocessing and
postprocessing one method with another, integrating methods, and blending of separately
learned methods. The final solution with accuracy 8.48% better than the baseline Netflix
algorithm was a blend of 69 methods. In practice of building a recommender system, one
method should suffice to obtain satisfying accuracy, and, from the point of view of software
engineering and quality assurance, it is best to use only one selected prediction method.
To be easier to reproduce the methods and verify the accuracy, in the first part “Simple
models” 4.2 the accuracy will be measured by RMSEprobe (the entire probe set will serve as
a hold-out set). In the remaining parts of the chapter RMSE15 is used for evaluation (see
section 3.4). The main advantage of using RMSE15 was that it allowed to train methods
only once, at the cost of small decrease of accuracy in comparison with re-training methods
on the entire training set.
The methods listed were developed to predict movie ratings, but they seem to work
equally well when applied to predicting ratings for any type of items – see, for example,
the experiments with predicting music ratings [Che11a, Jah11a, Jah11b].
4.1 Notation
The following notational conventions are used throughout the work:
R - the sparse matrix of ratings 1 − 5
S - the matrix of binary indicators, which movies were rated by which users
rij - the rating given by user i for movie j
r̂ij - the predicted rating
yij = rij − r̂ij - residuals of a predictor
ŷij - predictor of a residual (usually a rating with global effects subtracted)
N = 408189 - the number of users
M = 17770 - the number of movies
Ji - the set of movies rated by user i
Ij - the set of users who rated movie j
Mi = |Ji | - the number of ratings given by user i
Nj = |Ij | - the number of ratings for movie j
Jit - the set of movies rated by user i on day t
Ij2 j - the set of users who rated both movie j and j2
ci - the bias of user i
dj - the bias of movie j
uik - the preference of user i for feature k
vjk - the value of feature k of movie j
ui - the vector of preferences of user i
vj - the vector of features of movie j
i - index for users
j, j2 - indices for movies
k - index for hidden features
Vectors are always vertical n × 1, like in the book [Has09] or in the R language envi-
ronment. For example, when we have a matrix X of dimensionality n × p, the vector xi
57
denoting the i-th row, is vertical p × 1, not horizontal 1 × p. Vectors are denoted by bold
letters.
Real values are rounded to between one and six digits, most often to four digits.
The word “empirical” denotes estimates observed in the training set, e.g. empirical
mean, empirical probabilities.
The word “support” denotes frequency, count of data in the dataset, e.g. user support
is the number of ratings given by the user.
58
natural complement connecting the prediction methods described in the work with the
context of applications in recommender systems. To properly position an item on a list,
uncertainty of prediction should be taken into account. We will think also on the influence
of the missing data structure.
In section 4.2.5 I proceed to the analysis of the regularized SVD with biases in its
simplest version with only one feature. The method of regularized SVD and other kinds of
matrix factorizations turned out to be the most efficient approaches to the Netflix Prize
task, so it is useful to examine these kinds of methods more closely, to justify the form of the
SVD model and the choices made while modelling and learning the parameters. I will ask
myself questions: why to use a structure with multiplying the hidden variables? Which
prior distributions should the hidden variables have? I will consider different possible
ways of learning the parameters: different approximations of the Bayesian approach, and
different ways of regularization in the neural-networks-like approach.
To evaluate the simple methods in the following sections the entire probe set will be
used as a hold-out set (accuracy measured by RMSEprobe instead of RMSE15 , used in the
rest of the work).
59
(MAP) estimate. The Bayesian approach requires specifying prior beliefs. We have some
prior expectation about where the expected value lies, e.g. because we know that the
ratings are in the range 1 − 5, but we use here a simple flat prior, which leads to maximum
likelihood (ML) point estimation. It is one of the very few places in this work, where a
method from classical statistics is good enough. Usually it is visibly better to use a non-
flat prior, centered around some value (usually around zero), which results in obtaining a
MAP, regularized estimate. Also, often using point estimation is a too large simplification
– sometimes it is necessary to use a Bayesian or approximate Bayesian approach instead.
The resulting ML estimate of the mean value µ on the data distribution is simply
P over the combined ratings from training.txt and probe.txt sets: R = E µ̂ =
the average
1
|T r+P r| ij∈T rP r rij = 3.6043 .
There is more to the story. The two datasets training.txt and probe.txt have a dif-
ferent structure of missing data, We already observed the difference between empirical
probabilities of each rating 1 − 5 (figure 3 in section 4.2.2).
The average rating in the training set is: R1 = N11 (i,j)∈T r rij = 3.6033, N1 = |T r| =
P
99, 072, 112, and the average rating in the probe set is: R2 = N12 (i,j)∈P r rij = 3.6736,
P
N2 = |P r| = 1, 408, 395.
One might wonder whether the difference in averages is a result of a random fluc-
tuation or a difference in distributions. To check it, a standard statistical significance
test from classical statistics [Kor06] will be good enough (proper Bayesian hypothe-
sis testing [Mad03] may be more accurate). We make the same assumptions as earlier
about independence and normality, and that the two datasets come from normal distri-
butions with equal variance, but possibly different means. We want to check the hy-
pothesis that the means in two datasets are equal µ1 = µ2 with the alternative hy-
pothesis q µ1 6= µ2 . The sample standard deviation on training + probe sets (T r + P r)
1 P 2
is: S = N1 +N2 (i,j)∈T r+P r (rij − R) = 1.0852, where R = 3.6043 is the average on
q
T r + P r. The statistic T = (R1 − R2 )/(S N11 + N12 ) under the null hypothesis has a
t-Student distribution with N1 + N2 − 2 degrees of freedom. With so many degrees of
freedom, the t-Student distribution is very close to the normal distribution N (0, 1). For
our data we get a very large value T = 76.33, which tells that the null hypothesis about
the means in the two datasets being equal is almost impossible. We made a few simplify-
ing assumptions in that test, but the conclusion about significant difference of means in
the datasets would be the same if the test were done more correctly, assuming a discrete
distribution, and using the Bayesian approach (we should note, that the whole idea of
making a test to select one model is non-Bayesian).
Knowing that the means in two datasets differ, we can wonder whether domain adapta-
tion can be used to learn a more precise mean for the probe set, and generally, whether the
algorithms trained on the training set can be adapted to give more accurate predictions on
the probe set. The general experience of practitioners was that attempts of domain adap-
tation did not improve prediction accuracy on the Netflix Prize dataset. It is so, because
the developed models more complex than the single global mean have enough structure to
explain differences between the training and Pr+Qu set. The only methods used that can
be seen as domain adaptation were blending on the test set the predictors learned on the
training set, and also there were small improvements of accuracy by using dates to adapt
methods to the test set [Bel08].
[Dau07] points out two extreme situations in domain adaptation: 1) training data is
from the same distribution as test data, or 2) training data distribution is completely
unrelated to test data distribution. Usually the situation with analyzed data is somewhere
in between 1) and 2). The Netflix Prize dataset is very close to the case 1). For our toy
task of estimating mean of the probe set distribution certainly there is some “right” way
60
of using the training set, which is over 60 times larger than the probe set, but I have not
performed any experiments with it (for example, training data could be used to help define
a prior for the mean of the test data). [Dau07] lists and compares several possible simple
methods of domain adaptation: modelling the source (training) data only, modelling the
target (test) data only, modelling all data from the source and target combined, the source
and target combined with different weights, using methods learned on the source data as
features for the target data (blending on the test set, used extensively in the Netflix Prize
is a case of this), linear interpolation of a model learned on the source data and on the
target data, using the source data to determine the prior distribution for the target data.
Another approach [Dau06] is a full probabilistic model for source and target data, that
treats both datasets symmetrically. There are also approaches weighting the training set
samples by predicted probability that the observation belongs to the training or to the test
set [Bic07, Bic09], but we should note, that the training-test relationship in Netflix data
is not a typical situation of covariate shift, because here the data contain only outputs,
without fixed predictors, like in a typical case of linear regression. Training and test sets
in Netflix data differ by the amount of observed data among users, and also by the date
variable, because the test set contains the most recent ratings.
To summarize, the issues met while analyzing the simplest model of global mean were:
the data is not missing completely at random, we made simplifying assumptions about
normality and independence of data samples, and we observed a difference between the
training distribution and test distribution, which can be ignored or one can try domain
adaptation. There are also methodological questions: which general modelling approach
to use? Which simplifications of the Bayesian approach to make? Is calculating the entire
posterior distribution necessary, or is point estimation good enough?
All of the above questions and issues will recur in a more complex form when analyzing
more complex models. Now I will look closer at approximating ratings with a normal
distribution, what was done in most of the later models.
61
like methods that based on the cost function minimization, which can be understood as
modelling the expected rating directly, without precisely specifying the posterior distri-
bution of the output (without estimating the output variance). The expected value of the
Gaussian approximation was usually clipped to the range 1 − 5 [Fun06], or sometimes
transformed through a sigmoidal link function to limit the range of the output variable
[Pio09]. Typically used as a hidden structure matrix factorizations were sometimes modi-
fied by additional user-specific parameters, for example using a per-user scaling parameter
[Kor09b, Pio09] or a transformation through a third degree polynomial [Pio09].
In this section I examine closer the issue – what is the right way to generate the output
variable (ratings). Determining the proper way of output modelling is one step on the way
to discovering which process plausibly generated the data. Creating a model for generating
ratings is an intermediate goal for the task of minimizing expected MSE (I do not need
to model accurately those aspects of the data that are unrelated to the chosen prediction
task). The goal I set to myself is to build models capable of generating ratings with a
possibly similar distribution to observed ratings, but using as few parameters as possible,
because using too many parameters complicates the used methods, and often introduces
overfitting the data, which can deteriorate the predictive accuracy. Looking at different
possible kinds of outputs in efficient methods applied to the Netflix Prize task, we have
several possibilities to verify: either to use a clipped and rounded normal variable with
one hidden parameter (mean) or with two hidden parameters (mean and variance), or
modelling each of the probabilities for ratings 1 − 5 separately, with 4 − 5 variables per
rating. Close to the latter approach is treating the rating as an ordinal variable. It is also
possible to model the output as a binomial variable, with one or two hidden parameters
(mean and possibly range).
In the section I will visually examine empirical probability distributions for users and
for movies. Then I will present experimental results on nonparametric models, that approx-
imately generate the data. In those models the output is a rounded normal distribution,
clipped to range 1−5. Empirical distributions in the generated data will be compared with
the observed data. I will conclude whether the clipped normal distribution is a good way to
model the output, and whether it is enough to model the hidden mean, or is modelling the
variance necessary. Further simplifying the methods is possible through modelling directly
the conditional expected value E(rating|user, movie, time).
We can wonder if the situation for simple models analyzed in this section is similar
to more complex models that more accurately model the ratings. I have not verfied it
– I assumed that for more complex models, with MSE accuracy smaller by 10 − 20%
(10 − 20% more variance explained), conclusions about the output variable would be
similar. Of course it is possible that with a larger pattern modelling capability when using
more parameters in the hidden structure, using more parameters for the output variable
could give an accuracy improvement unnoticed in the simpler models.
Let’s look first at the observed rating distributions for a few movies and users. Plot 18
shows the empirical rating probabilities of a few example movies chosen among the 1000
most frequently rated movies in the Netflix Prize dataset. “Empirical rating probabilities”
denotes the estimated distribution of ratings in observed data, calculated here as max-
imum likelihood (ML) point estimators of probability of each rating for a given movie.
The examples chosen are extreme: distributions with largest “jumps” (largest differences
between adjacent ratings), and distributions with the largest variance. The examples were
chosen among the 1000 most frequent movies, to make sure that the observed distributions
are close to the real underlying distribution, and are not a result of random fluctuations.
The bottom of the plot 18 shows the distributions of eight most frequently rated movies
for comparison.
We conclude from the plot 18 that a Gaussian distribution with an appropriately
62
Movies
Figure 18: Movies (1kfrequent,
(1k most most frequent, min.=support
min. support = 25773):
25773): example empirical rating
probabilities. Example empirical rating probabilities
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Largest variance:
Fahrenheit 9/11 Napoleon Dynamite I Heart Huckabees Moulin Rouge
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
63
chosen mean and variance, and clipped to range 1 − 5, seems to be a good tool to model
the rating distribution of a movie. The shown extreme distributions do not contain movies
that visibly could not be modelled this way. A good choice seems to be also a binomial
distribution or a mixture of binomial distributions.
Let’s look at an analogous plot of the empirical probabilities for 1000 most frequent
users in the dataset. Figure 19 shows the distributions of chosen users from that group:
distributions with largest jumps, and most frequent users for comparison. Users are num-
bered according to their support. The smallest value of support (number of ratings) in
the group of most frequent 1000 users is 2082. It is a value large enough to say that the
visible characteristics of extreme distributions are not a result of randomness in the data,
but have a real shape of the underlying user-specific distribution.
0.8
0.8
0.8
0.4
0.4
0.4
0.4
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0.8
0.8
0.8
0.4
0.4
0.4
0.4
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Largest variance:
User 146 User 667 User 62 User 808
0.6
0.6
0.6
0.6
0.3
0.3
0.3
0.3
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Most frequent:
User 1 User 2 User 3 User 4
0.8
0.8
0.8
0.8
0.4
0.4
0.4
0.4
0.0
0.0
0.0
0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Users who give most ratings may be not representative for the whole set of users – they
may have a different rating distribution than users who give less ratings. For example, it
is likely that many of the users with many ratings give a fixed rating for movies never
seen. Figure 20 shows the extreme empirical distributions in the group of users with the
64
frequency rank 50, 000 − 51, 000. The smallest number of rated movies in this group is 517.
We see that the extreme distributions in this group look similar to the previous group
of most frequent users. The differences are small, in particular, there are no distributions
with a large probability of rating 2.
Largest variance:
User 50951 User 50412 User 50843 User 50991
0.0 0.3 0.6
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
The plotted extreme empirical distributions for users (figures 19 and 20) differ consid-
erably from the extreme distributions for movies (figure 18). The binomial distribution on
the range 1−5 fitted the movie distributions fairly well, but it cannot be used to model the
user distributions. Clipped and rounded normal distribution could approximately generate
the observed user distributions with an appropriate choice of variance: very small variance
or very large. The overall conclusion from the plotted example distributions is that we
should expect many user-specific patterns that visibly differ from typical one-parameter
or two-parameter distributions. The capability of modelling such patterns may improve
the accuracy. It is a probable reason for efficiency of the RBM method, which models
each rating separately. Variants of RBM significantly improved accuracy of the ensemble
of many efficient methods, so it is likely that they learn something that is not learned by
other methods (most of the other methods included in the accurate ensembles used fewer
parameters than RBM to model the output distribution).
Let’s look at the plots visualizing much larger groups of users and movies. Figure 21
shows scatter plots of the empirical probabilities for a subset of 20, 000 users randomly
drawn from 100, 000 users with most ratings, and analogous scatter plots for all 17, 770
movies. 2 × 10 scatter plots are presented, visualizing dependencies between empirical
probabilities of each pair of ratings: p̂i and p̂j , i, j ∈ {1, 2, 3, 4, 5}.
We approach the question: how to best model the output? Which probabilistic model
65
Figure 21: Empirical rating probabilities in the observed data – scatter plots
could generate similar data? Of course we could model the probability of each rating
separately, using 5 parameters, and normalizing resulting probabilities to make them sum
to 1, or using 4 parameters, where the probability of each rating is dependent on the
four remaining ratings. But we see in figure 21 and on the plots of extreme distributions
(figures 18, 19, 20), that there is a large degree of dependence of rating probabilities
in each possible pair of ratings. The observed distributions are very distant from e.g. a
Dirichlet distribution, where pairs of probabilities are close to being independent. The
observed dependencies suggest that it is possible to model the output variable with fewer
parameters than 4−5 per rating. We want to reduce the number of parameters, because for
two models with similar capability of explaining effects and dependencies in data usually
the smaller model has better accuracy. The plots suggest, that a large part of the variability
of the observed distributions can be explained by one location variable (for example, the
mean of a normal distribution).
To verify to what extent the mentioned dimensionality reduction is possible, I devel-
oped four simple models for rating generation in a multitask learning setting with joint
nonparametric priors for hidden parameters. The models are: hidden user mean, hidden
movie mean, hidden user mean and variance, and hidden movie mean and variance. I will
describe more in detail the user mean model. The remaining three are analogous.
The user mean model learns one parameter (mean) µi per user, and generates all user
ratings using just this one parameter. It is a hierarchical model with multitask learning
structure, with a common prior η for each user’s hidden mean µi . Good practice in prob-
abilistic modelling is to use a method with more parameters first, then after noticing a
pattern, regularity, known distribution, attempt to reduce the model to a smaller or sim-
pler one. We will use a nonparametric prior η in the form of a Dirichlet process, using a
uniform distribution on a fixed grid of values as the base distribution (the only parame-
ter of the Dirichlet process). “Nonparametric” means a model capable of increasing the
number of its parameters with increased number of observations (unlike parametric ones,
like Gaussian distribution, with a fixed number of parameters – mean and variance). In
our setting this method essentially works as a histogram, but a classical histogram is cal-
culated on observed data, and here we calculate it on unobserved variables, using the set
66
of observed samples as a density estimator.
The parameters of the model are learned iteratively, with an empirical Bayes type ap-
proach. The method is analogous to the nonparametric methods used later in sections 4.2.3
and 4.2.5. In the iteration K the prior η̂K−1 learned in the previous phase is used to draw
samples from the posterior distribution for each µi , using the Metropolis-Hastings algo-
rithm. We assume that the ratings are generated from a Gaussian distribution N (µi , σ 2 ),
rounded to the nearest integer and clipped to range 1 − 5, and this assumption is used
to calculate the likelihood of observing the data, for example,R 1.5 the probability of giving
the rating 1 by user i in this model is p(Di = 1|µi ) = σ√12π −∞ exp(− 2σ1 2 (x − µi )2 ). The
parameter σ 2 was arbitrarily fixed to 1.1 in the user mean model and to 1.5 in the movie
mean model. Then, taking the Empirical Bayes Monte Carlo approach, one sample from
each user’s posterior distribution is used for (inexact) maximum likelihood estimation of
the common prior η̂K . On average 1/e ≈ 36.8% of samples would not be drawn when sam-
pling with replacement. Because I did not want the generated values to disappear from
the prior, I used sampling the set without replacement, which results in biased sampling.
The algorithm starts with an initial prior η̂0 of N = 100, 000 real values drawn from the
uniform distribution on a range < 0.5, 7.0 >. In each subsequent iteration K the prior
η̂K−1 is simply a set of N values drawn from the posterior distributions for µi in the
K − 1 -th iteration, one value per each user. There method was run for 200 iterations.
The model for a hidden movie mean is analogous, but it is more difficult to learn the
nonparametric prior in it, because there are only 17, 770 tasks (in the user model there
were 100, 000 tasks).
The algorithm worked well enough for the goals of the experiment (we want to discover
the optimal shape of the prior, and observe how the model with this prior generates
ratings), but it should be noted that the used form of the algorithm has its drawbacks.
If some value disappears from the prior ηK in some iteration K, in the next iterations
it will not appear again. As a result, the tails of the resulting prior distributions are far
from being precise. A remedy for disappearing values could be drawing additional values
in each iteration, for example, from a uniform prior (but this way we arbitrarily affect
the shape of the learned prior). Another way is to increase the number of samples from
the posterior distributions of µi , which form the prior ηK (in the used methods only one
sample per user was used). Other, more “smooth” nonparametric priors could be used, in
which near values are more probable to have a similar pdf value (comparing to commonly
used nonparametric methods, our method was close to calculating a histogram, and we
could use kernel density estimators instead).
Figure 22 shows scatter plots of the empirical probabilities (black dots) for the hidden
mean model. The blue lines mark the probability, with which Mi ratings were drawn for
each user i. On the left plots are marked 20, 000 users randomly chosen from the 100, 000
users with most ratings, for which the model was built. On the right plots are marked all
17, 770 movies in the hidden movie mean model. It is apparent that the empirical probabil-
ities in the data generated by the learned Mean model have similar shapes as the empirical
probabilities for the original data, shown on the earlier scatter plots (figure 21), but there
are differences between these two groups of empirical distributions. The approximation
by the Mean models is inaccurate, because having only one hidden parameter allows to
change probabilities only along the fixed line (marked with the blue line on the plots).
A model with two parameters per rating: mean and variance, and a nonparametric
prior S on the variance σi2 turned out to better approximate the observed data. The
prior distribution is learned like in the previous model, with empirical Bayes, using the
Metropolis-Hastings algorithm to generate posterior samples from µi and σi2 . Figure 23
shows the empirical probabilities for data generated by the Mean+Variance model for
users (left) and by the Mean+Variance model for movies (right). Visually, the plots for
67
Figure 22: Empirical rating probabilities in the data generated from Mean models – scatter
plots
Mean+Variance model (figure 23) are much more similar to the original data (figure 21),
than was the data generated from the Mean model (figure 22).
Let’s examine plots visualizing the learned nonparametric priors η and S for the model
Mean+Variance for users. Figure 24 visualizes the priors for mean and variance with
histograms. Next are displayed: a QQ-plot that compares the learned prior for mean with
quantiles from the Gaussian distribution, and a plot showing how the learned variance
decreases with increasing number of ratings given by the user. Figure 25 displays an
analogous set of plots in the model Mean+Variance for movies. The learned priors for
mean in both models shown in figures 24 and 25 are similar to Gaussian distributions,
and the priors for variance are similar to inverse Gamma distributions, but they are not
exactly distributions from those families – there are differences visible. It should be noted
that the shape of priors is more credible (likely closer to optimal priors) in the areas of
large probability mass.
Summarizing the subject of output modelling in the Netflix task, the most accurate
models described later in this work use only one parameter to model the expected output,
so it can be suspected, that the number of parameters needed to generate ratings can be
reduced to less than 4 needed to model the probability of each rating separately. This
section explored the idea to use clipped and rounded Normal distribution to generate
ratings. Two models were proposed: the Mean model and the Mean+Variance model,
each in two versions – for users and for movies. We plotted the empirical probabilities of
data generated from the learned four models. Comparing with the empirical probabilities
in original data, the data generated from Mean model visually differs from the original
data – the generated data has similar shape, but concentrates only in a certain region.
The Mean model is not capable of modelling fully the original data. In turn, the data
generated from the Mean+Variance model is visually very similar to the original data.
I conclude, that one hidden parameter is not enough to generate output similar to the
original ratings, but two hidden parameters allow for a good approximation, hence we can
reduce the dimension of the output variable.
Instead of using a Gaussian distribution, it is possible tu use a binomial distribution
68
Figure 23: Empirical rating probabilities in the data generated from Mean+Variance mod-
els – scatter plots
[Wu09], but in that case useful would be a possibility of increasing variance, for example,
by treating the location parameter as a random variable with uncertainty. A binomial dis-
tribution with increased variance would be good enough for modelling movie distributions,
but is not capable to fully model user distributions. For some users decreasing the variance
is needed, for example, limiting the range of binomial distribution to shorter than 1 − 5
could be used. For modelling users with a large variance, e.g. users who give many ratings
1 and 5, a mixture of binomial distributions could be used.
Of course, in the Netflix Prize task we are mainly interested in optimizing the predictive
accuracy according to the MSE loss, so we do not need an accurate model for generating
ratings here, but it is enough to model well the expected output E(rating|user, movie, time).
Modelling the output variance parameter σ 2 (user, movie) [Tom07] can useful for our pre-
diction task – it influences weighting of observations and can improve predictive accuracy
(see section 4.3.3). Approximating the posterior variance (note that this is different from
assessing the σ 2 parameter in the model) or the full posterior distribution of ratings is
needed to calculate recommendations (see section 3.4).
Another possibility, not tried here, is treating ratings as ordinal variables [Ste09, Paq10,
Kor11], allowing to obtain better RMSE accuracy than modelling just the expected value
of a Gaussian variable [Paq10, Kor11]. Still, it would be useful to determine if a four-
parameter ordinal variable is better than a two-parameter truncated Gaussian variable,
for the Netflix data.
69
Figure 24: Learned prior distribution in the Mean+Variance model for users: histograms
of prior for mean and prior for variance, QQ plot mean vs. Gaussian, variance vs. log user
support
datasets, simple models containing biases give prediction close to the optimal (for example,
for the two tasks of modelling match outcomes, described in section 6.3). An advantage
of bias models is that simple cost function optimization methods are often sufficiently
accurate for parameter inference.
In this section I will examine some of the simplest models for the Netflix Prize task –
models containing only movie biases, user biases and the global mean. The models of biases
will be a good example illustrating the methodology of prediction used in this work. While
developing different predictive models we are faced with similar questions, and the answers
are more clear for simple models. I will compare different approaches to prediction and
different ways of learning parameters. Simultaneous learning of biases with regularization
turned out to be a good preprocessor or a good part of other methods, such as regularized
SVD with biases.
The output variable will be approximated with a Gaussian distribution, as it was done
in most methods used for the Netflix task. This decision was to some extent justified in
the previous section.
The analyzed models with biases have the form:
r̂ij = µ + ci + dj
where ci is the bias of user i, dj is the bias of movie j, and µ = 3.6033 is a constant –
global mean of the training set.
Accuracy is measured by RMSEprobe , that is RMSE on the whole probe set. The
predictions of the model were clipped to range 1 − 5 before calculation of RMSE. The
70
Figure 25: Learned prior distribution in the Mean+Variance model for movies: histograms
of prior for mean and prior for variance, QQ plot mean vs. Gaussian, variance vs. log user
support
RMSE of the global mean is put in the table for comparison, as a baseline to show how
much variability in data was explained by the models of biases. The subject of predicting
with one global mean variable was discussed in section 4.2.1, along with the arguable use of
the average rating and discussing the difference between the average rating on the training
and on the probe set.
71
Table 7: Models with biases only. Comparison of experimental results.
The global mean, a component of all models in this section, will be treated as a fixed
constant µ = 3.6033. Treating the global mean as an additional variable with a proper
Bayesian inference would have no significant influence on prediction accuracy.
The model of user bias for simplicity is written as µ + ci for user i. This notation is
ambiguous, as it does not specify the precise meaning of the parameter ci . Model structures
specified this way will be treated either in a probabilistic way, where ci is a random variable
with a prior distribution, or in a cost function minimization approach, where ci is a single-
valued parameter with value found through minimizing a regularized cost function. For
simple bias models in this section, the Bayesian approach with proper choice of prior
distributions results in similar predictive accuracy to cost function minimization with
proper choice of regularization.
In the Bayesian treatment of the user bias model all parameters ci (one for each user)
are random variables, where we assume equal prior distributions N (0, τc2 ) (equal, because
we assume exchangeability of users). About the ratings rij I assume, simplifying, that they
were sampled from a normal distribution N (µ + ci , σ 2 ). I assume here that τc2 and σ 2 are
chosen constants. We are minimizing expected MSE, and the resulting optimal prediction
is the expected value of the posterior rating distribution: Erij = µ + Eci . The Eci values
2
have the form of shrinkage estimators with a regularization parameter λc = στ 2 :
c
1 X
Eci = (rij − µ)
|Ji | + λc
j∈Ji
The constant λc was set to the integer value that results in minimal RMSE on the
test set. The chosen best integer value was λc = 8. It is possible also to learn λc jointly
from all tasks (learning of 480, 189 parameters can treated as a set of similar tasks) – the
approach of multitask learning (or empirical Bayes [Car00, Nor07]) will be used in some
other models in this section.
The movie bias model µ + dj is similar. Predictions are made by Erij = µ + Edj ,
where dj is a movie bias. The expected values Edj have an analogous form of shrinkage
estimators:
1 X
Edj = (rij − µ)
|Ij | + λd
i∈Ij
As before, we set λd to an integer value minimizing RMSE on the test set (the probe
set). For the movie bias model it is λd = 25.
72
Let’s move to the models with both biases, user bias ci and movie bias dj . All of these
models have the form µ + ci + dj , where µ = 3.6033 is a constant, ci are 480, 189 variables,
one for each user, and dj are 17, 770 variables, one for each movie. The simplest idea of
learning the parameters, which worked well as a preprocessing for more complex methods
[Fun06, Bel07a], was sequential learning: first calculate the movie biases, then calculate
the user biases on the resulting residuals (the method labeled in table 7 as movie-user):
1 X 1 X
dˆj = (rij − µ) ĉi = (rij − µ − dˆj )
|Ij | + λd |Ji | + λc
i∈Ij j∈Ji
Reversed sequence of learning leads to worse accuracy. In the model listed in table 7
as user-movie first the user biases are calculated, then the movie biases on the residuals:
1 X 1 X
ĉi = (rij − µ) dˆj = (rij − µ − ĉj )
|Ji | + λc |Ij | + λd
j∈Ji i∈Ij
In place of sequential learning, better predictive accuracy gives repeating several times
learning user and movie biases on residuals of each other – a method labeled in table 7 as
“MAP simultaneous”:
1 X 1 X
ĉi = (rij − µ − dˆj ) dˆj = (rij − µ − ĉj )
|Ji | + λc |Ij | + λd
j∈Ji i∈Ij
Three different sets of parameters were used for simultaneous learning. The first version
without regularization: λc = λd = 0, gives much worse results than the regularized versions.
The second set of regularization parameters was optimal for models “only user bias” and
“only movie bias”: λc = 8, λd = 25. The third set λc = 7, λd = −3 was learned by
an automatic optimizer, to minimize RMSE on test set (probe set). The optimizer used
was the Praxis module from the library Fortran Netlib, which was also used for several
methods described further in the work. The parameters after optimization were rounded
to the nearest integer values. Interestingly, the optimal parameter λd turned out to be
negative. Automatic tuning resulted in RMSEprobe = 0.9826, the lowest of the learning
methods examined in this section.
I gave several examples, how is it possible to choose the regularization parameters λc ,
λd by minimizing the validation error. If we specify a full probabilistic model with shared
hyperpriors, then we can estimate these regularization parameters from the training data in
an empirical Bayes approach. If the probabilistic model is correct (is close to the unknown
model that generated the data) and if we have not oversimplified the parameter inference,
accuracy on the test set should be close to the results of automatic parameter tuning.
We have decided on the basic form of the model – that predictions will be made by
the sum µ + ci + dj . If ci and dj are treated as random variables, different approaches can
be used to infer their posterior distributions. One possibility is to use MCMC methods
based on Gibbs sampling, where parameters θi are cyclically sampled from their posterior
distributions conditioned on earlier drawn values of all remaining parameters p(θi |D, θ −i ).
This way samples from the joint distribution p(θ|D) are obtained, although the subse-
quent samples are not independent. Another method is the Variational Bayes approxi-
mation, where the parameters are split into groups, about which is assumed, that the
groups are independent in the joint posterior distribution. With this constraint, minimiz-
ing the Kullback-Leibler divergence between the approximation and the true posterior
using variational methods leads to approximating the posterior distribution of θi with:
log p̂(θi |D, θ −i ) = C + Ep̂(θ−i |D) log p(θ|D). An advantage of the VB method in com-
parison to MCMC is much fewer iterations (because the end result of VB are probability
distributions, unlike MCMC, which needs to average over generated samples), and often
faster convergence.
73
Yet another possibility is to use alternating MAP estimation of parameters. That
approach worked well for the simple model of two biases (the already mentioned method
“MAP simultaneous”), but, as we shall see later in the work, it has worse accuracy than
MCMC and VB in more complex models (although e.g. for SVD models accuracy is largely
improved by specially chosen priors, unequal for all users or for all movies).
I will examine the Variational Bayesian method and MCMC based on Gibbs sampling
for our model of biases. The VB method reduces here to alternating estimation of ci and
dj with their expected values, with learning the joint priors in the hierarchical model. I
assume a joint prior distribution of ci in the form of N (c, τc2 ), and an analogous prior
distribution of di as N (d, τd2 ). To simplify, I will use maximum likelihood estimation for
the parameters τc2 , τd2 , σ 2 , that is MAP estimation (maximum a-posteriori) with flat priors
(the parameters τc2 and τd2 can be called hyperparameters, and their prior distributions –
hyperpriors).
The user and movie parameters depend on each other, and learning them will require
multiple iterations. We are interested in the distribution p(ci |D, d, τc2 , σ 2 ), where dj ’s are
either drawn from their current posterior distributions in the MCMC approach with Gibbs
sampling, or fixed to their expected value Edj of the distribution p(dj |D, c, τd2 , σ 2 ) in the
VB approach. I assume here that τc2 , τd2 and σ 2 are known constants, calculated based
on the approximated distributions or samples of ci and dj in previous iterations. The
conditional posterior distribution p(ci |D, d) is Gaussian:
2
P
j∈Ji (rij − µ − dj − ci ) (ci − c)2
p(ci |D, d) ∝ p(D|ci , d)p(ci ) = f (ci ) = exp C1 − −
2σ 2 2τc2
P
j∈Ji (rij − µ − dj ) c 1 2 |Ji | 1
= exp C2 + ci ∗ + − c ∗ +
σ2 τc2 2 i σ2 τc2
With fixed d (sampled di in MCMC, or di fixed to Edi in VB) the posterior distribution
p(ci |D, d) is N (Eci , V ar ci ) with the following parameters:
1 (c − Ec )2
i i
p(ci |D, d) = √ exp −
2πV ar ci 2 V ar ci
1
V ar ci = (10)
|Ji |/σ 2+ 1/τc2
P − µ − dj )
j∈Ji (rij c
Eci = + 2 ∗ V ar ci (11)
σ2 τc
I assume equal prior distribution N (c, τc2 ) for all ci . The parameters c, τc2 of the prior
distribution are point estimated with empirical Bayes method, on the basis of the posterior
distributions of ci ’s in the previous iteration of the algorithm.
N
1 X
c= Eci (12)
N
i=1
N
1 X
τc2 = ((Eci )2 + V ar ci ) − c2 (13)
N
i=1
74
P
i∈Ij (rij − µ − ci ) d
Edj = ( + ) ∗ V ar dj (15)
σ2 τd2
The prior distribution of di was assumed in the form N (d, τd2 ). The parameters d, τd2
are calculated as point estimates:
M
1 X
d= Edj (16)
M
j=1
M
1 X 2
τd2 = ((Edj )2 + V ardj ) − d (17)
M
j=1
The parameter σ 2 can be estimated with maximum likelihood, using sampled ci and
dj :
1 X
σ2 = (rij − µ − ci − dj )2 (18)
|T r|
ij∈T r
The MCMC method with Gibbs sampling is simpler conceptually comparing to Varia-
tional Bayes, has similar predictive accuracy, but needs many more iterations of the algo-
rithm. The MCMC method iterates between sampling ci according to (10),(11), sampling
dj according to (14),(15), and re-estimating the hyperparameters c, d, τc2 , τd2 , σ 2 according
to (12),(13),(16),(17),(18) Table 7 compares two implementations. In the first one predic-
tions are made by the expected value Erij = µ+Eci +Edj after 50 iterations of Gibbs sam-
pling. In the second one predictions are made by running the method for additional 50 iter-
ations and averaging the samples (averaging expectations leads to the same accuracy as the
first method). After 50 + 50 iterations the following parametrization of the common prior
distributions was obtained: c = 0.07, d = −0.30, τc2 = 0.19, τd2 = 0.26, σ 2 = 0.84, which
corresponds to the regularization coefficients λd = σ 2 /τc2 = 4.42 i λd = σ 2 /τd2 = 3.23.
The Variational Bayesian method assumes independence of ci and dj in the posterior
distribution. Posterior ci is approximated by log p̂(ci |D, d) = C + Ep̂(d|D) log p(ci , d|D),
where p̂(d|D) is the approximation of the probabilities p(dj |D, c) calculated in the previous
iteration of the algorithm. In the model of biases the VB method boils down to putting Edi
in place of di ’s in the equation (11), and putting Eci in place of ci ’s in the equation (15),
hence it is called “alternating expectation” in the table 7. This method needed four iter-
ations to converge, with RMSEprobe = 0.9829. The algorithm iterates between calculating
approximate posterior ci and dj distributions and re-estimating hyperparameters:
1. for (iter in 1..4)
2. for (i in 1..480189) // loop over users
3. update ci according to (10),(11)
4. for (i in 1..17770) // loop over movies
5. update dj according to (14),(15)
6. update c, d, τc2 , τd2 , σ 2 according to (12),(13),(16),(17),(19)
75
Comparing the accuracy of the presented parametric methods, we can wonder why the
method “MAP simultaneous, opt.” gave the best accuracy. A probable reason is inaccuracy
of the assumed model.
The last method described are biases with nonparametric priors. A question arises,
whether Gaussian distribution is the right choice for a prior distribution of parameters
ci , dj , and if not, whether it is worth to model the deviation accurately. To verify if the
choice of Gaussian for priors is correct, I assumed a prior in a nonparametric form of a
Dirichlet process. We have enough tasks sharing the same prior (over 480 thousand ci ’s
and over 17 thousand dj ’s), to make such an attempt of nonparametric modelling. The
used nonparametric method is similar to the described in [Yu04, Tre04, Xue07]. I used
similar methods in sections 4.2.2 and 4.2.5
The nonparametric model is formed by changing the priors N (c, τc ) and N (d, τd ) to
Dirichlet processes with concentration parameter zero (i.e. without choosing a hyperprior
probability measure). A Dirichlet process (DP) is a generalization of the Dirichlet distribu-
tion to infinite (of cardinality continuum) number of parameters, and can be understood
as a distribution over probability distributions (probability measures). In the used model
there are no hyperparameters c, d, τc , τd . Still remains the parameter σ 2 , which exact
choice has little influence on the result of the algorithm, and was set to σ 2 = 0.5.
A simplification made is assuming a grid of possible values. Our prior distribution
in the form of a DP generates probability distributions over the points of the grid, one
distribution for every task (for one DP prior the tasks are users, for the second prior,
the tasks are movies). The assumed grid of values for both priors are uniformly spaced
500 points on the range < −5, 5 >, in increments of 0.02. The method can be seen as
repeatedly using a histogram to estimate the density of priors for hidden variables. The
nonparametric method used here and also in other places of the work (output modelling,
priors for SVD with one feature) has its disadvantages, such as unclear behavior for tails,
but the method was good enough for visualization and to assess what would be a good
parametric form of the priors.
There are multiple ways to realize the algorithm with DP priors. I decided on an
approach where the values of ci and dj are sampled from their posterior distributions using
the Metropolis-Hastings algorithm [Tar05, Mos95]. Likelihood of ci depends on values of
dj , and vice-versa, so I use here the Gibbs sampling method, alternating using sampled dj
to sample ci , and using sampled ci to sample dj . We can say, that in this nonparametric
method the set of posterior samples in one iteration is the prior distribution in the next
iteration.
Instead of sampling with replacement and obtaining independent samples from the
prior, I used sampling without replacement (realization through the function random shuffle).
If e.g. in the prior distribution some value appears 10 times, it will appear exactly 10 times
in the set after one sample from proposal distribution of each user (or movie).
The pseudocode of the algorithm is following (the numeric constants are left unnamed,
to save space and improve clarity):
76
1. Initialize c and d to random values from
the discrete set [−5, −4.98, −4.96, −4.94, ..., 4.98, 5]
2. for (iter in 1..100)
3. cnew = random shuffle(c)
4. for (i in 1..480189) // loop over users
5. l1 = log likelihood c(ci , Ji )
6. l2 = log likelihood c(cnew
i , Ji )
7. if (exp(l1 − l2) > random())
8. ci = cnew
i
new
9. d = random shuffle(d)
10. for (j in 1..17770) // loop over movies
11. l1 = log likelihood d(dj , Ij )
12. l2 = log likelihood d(dnew
j , Ij )
13. if (exp(l1 − l2) > random())
14. dj = dnew
j
15. c = average c from iterations >= 50
16. d = average d from iterations >= 50
The function random shuffle(c) returns a random permutation of vector c, each per-
mutation with the same probability. The function random() returns a sample from the
uniform distribution U (0, 1).
The function log likelihood c(x, Ji ) returns the logarithm likelihood of ci = x, given the
observed data, assuming that d parameters are fixed. Because we P assume, that the data is
from distribution N (µ + ci + dj , σ 2 ), log likelihood c(x, Ji ) = 2σ1 2 j∈Ji (rij − µ − x − dj )2 .
I assumed σ 2 = 0.5 (a more accurate choice of σ 2 does not make a visible difference to
the results). In the implementation it is necessary to prevent overflows of the function
exp(l1 − l2), for example, by clipping l1 − l2 to the range [−10, 10].
Now we’ll visually examine learned prior distributions in MCMC models with Gibbs
sampling with a parametric prior (Gaussian distribution) and nonparametric prior (Dirich-
let process). Six plots in figure 26 visualize the learned priors. The plots A) and B) are
histograms of samples c and d coming from the learned model of biases with Gaussian
priors. For comparison, the prior Gaussian distributions were marked with a dashed line.
The plots C) and D) are analogous histograms for samples of c and d in the nonparamet-
ric method. E) and D) are quantile-quantile plots, on that the distributions of c and d
obtained from the parametric and nonparametric methods, after normalizing (centraliza-
tion and dividing by sample standard deviation) are compared with the standard normal
distribution.
We see on the plots that there are places where the learned prior distributions differ
from a Gaussian distribution. The most visible differences are that the distribution of c on
the negative side has a longer tail than a Gaussian, and in turn the distribution of d on its
positive side has a shorter tail than a Gaussian. The second observation is that no large
difference is visible in the posterior distributions between using a parametric Gaussian
prior and a nonparametric prior, therefore a correction of the Gaussian assumption should
not make a large difference (but the automatic optimization of regularization parameters
suggests, that it is possible to improve something, maybe another shape of prior for movies
would make a difference – this possibility was not investigated).
Let’s sum up the simplifications made in the used methods. I assumed shared Gaus-
sian priors for ci ’s and dj ’s, and treated the output rating as a Gaussian variable (the
output choice was discussed in section 4.2.2). I assumed no important difference between
the training and test sets (issue mentioned in section 4.2.1). I assumed exchangeability of
users and exchangability of movies, and I applied an approach, which can be understood as
hierarchical Bayesian modelling, multitask learning, or empirical Bayes (the boundaries of
those techniques overlap). The assumption of exchangeability (equal prior) is inaccurate
– we saw in section 3.5 that the average rating decreases with user frequency, but in-
creases with movie frequency. The parameters of the priors (hyperparameters) were point
77
Figure 26: Learned priors of c, d; in the parametric and nonparametric method
78
estimated in the Empirical Bayes setting by maximum likelihood, instead of, for exam-
ple, choosing hyperpriors and inferring the posterior distribution of the hyperparameters.
Similarly, the parameter σ 2 was point estimated by maximum likelihood. I assumed equal
variance σ 2 for all ratings. Accuracy can be improved by modelling variance individually
for users and movies [Tom07]. A small simplification was that the VB method assumed
independence of groups of parameters ci from dj in the joint posterior distribution. There
were also many simplifications in the used nonparametric method.
It turned out that the best prediction accuracy was obtained by alternating MAP es-
timation with weights optimized with automatic parameter tuning (“MAP simultaneous
opt.”), which gave RMSEprobe = 0.9825. This method is capable of correcting some inac-
curacies of the chosen probabilistic model, but the probabilistic methods were not much
worse: alternating expectation and two Gibbs sampling methods gave RMSEprobe between
0.9829 and 0.9831. A similarly good result, RMSEprobe = 0.9828 (“MAP simultaneous”)
was obtained by using regularization parameters λ from single bias models, thus opti-
mal for other (similar) models. Summarizing, all five variants of parametric methods with
simultaneous learning of biases turned out to have good accuracy here.
79
What does it mean, that a movie is the best on average or the best for an average user?
Are all users equally important? Do they have the same weight in the averaging process?
When one movie is better than another movie?
Let’s simplify the task even more. Let’s assume, that we consider one user, and decide
between two movies for him. Suppose for a moment that ratings are continuous. Let’s
assume, that the rating of one movie is drawn from a known distribution N (m1 , s21 ),
and the rating of the second from N (m2 , s22 ). We can use here the framework of decision
theory. The decision is to choose one of the movies, and each choice is evaluated by a utility
function, that tells how the user values every rating distribution N (m, s2 ). How to choose
the right utility function for the given user? If one distribution has a larger mean than
the second one, but also has a much larger variance than the second distribution, should
the first distribution be preferred? Clearly, a proper utility function will differ for different
users. Different users can allow a different level of uncertainty about the predicted item
rating. One user may want to have large precision, that is be sure that one the list of top
items there will be few mistakes, items that will turn out to be weak (e.g. have low average,
when they will be rated by more people). Another user may expect large recall, that is if
there is a good movie somewhere, it should not be skipped from a list of top items, even if its
predicted rating is uncertain because, for example, it was computed based on few gathered
user ratings (these are different definitions of precision and recall than the usual definitions
for a binary indicator of relevance). Similar dilemmas were mentioned in section 3.1,
in the discussion of balancing a recommender system between recommending popular,
proved content and long tail content, and in section 3.4 “Evaluation”. The user’s decision
is psychological, and to make a reasonable choice of utility function, one could conduct
surveys on a group of users, and then, with machine learning techniques, extrapolate
results to all users. I do not have this kind of additional data, so my analysis has to be
simplified and based on the data I have.
The analyzed situation complicates when comparing a large set of candidate movies.
If I want to perform a test if a given movie is the best in a set, I would compare a
Gaussian distribution with a maximum of multiple Gaussian distributions, which has the
form of an extreme value distribution (it should not make much difference, if we treat
this maximum as a fixed threshold, ignoring its uncertainty). Such test is performed for
all movies, creating a situation of multiple comparisons, and hence largely amplifying the
influence of uncertainty of predictions. Considering whole lists instead of only one top
movie complicates the problem further. Finally, a large issue in determining top movies
is considering the missing data distribution, that means, considering what causes that a
movie is watched and rated – because this can significantly influence the observed average
rating. Various measures evaluating rankings have been proposed that can be used to
evaluate lists of top items (see also sections 3.1 and 3.4), such as mean average precision
or NDCG, and one can search for lists optimizing the chosen measure. Those measures
have a disadvantage, that optimizing them directly has large computational complexity,
and a larger disadvantage, that they use only the training data, ignoring the missing data
mechanism. In real-world tasks usually we can find ways to bypass any computational
complexity issues. It turns out that in practice simplified approaches work well enough: a
way to evaluate quality (score) of a single item is chosen, and items are sorted according to
this score. Because in this work I describe methods that minimize RMSE, I am especially
interested in indirect approaches that base on rating prediction (as we shall see, besides
the expected ratings of items, in order to obtain good quality lists of top-k items also
assessing uncertainty of predictions will be needed). Yet another issue to consider is what
to do with similar items on a top-list – whether they are desirable or undesirable.
Let’s look first at the disadvantages of simplest methods that calculate top rankings.
One of the easiest ways to calculate a top ranking is to sort movies simply according to
80
Table 8: Ranking by arithmetic mean: top 15 movies.
Title Avg. of Count of Freq.
ratings ratings rank
(score)
1. Lord of the Rings: The Return of the King: Extended Edition 4.723 72600 306.
2. Lord of the Rings: The Fellowship of the Ring: Ext. Ed. 4.716 72274 303.
3. Lord of the Rings: The Two Towers: Extended Edition 4.702 73630 295.
4. Lost: Season 1 4.678 5758 2522.
5. Battlestar Galactica: Season 1 4.669 1436 5603.
6. Fullmetal Alchemist 4.597 1565 5774.
7. The Shawshank Redemption: Special Edition 4.593 137812 51.
8. Ghost in the Shell: Stand Alone Complex: 2nd Gig 4.586 174 12618.
9. Trailer Park Boys: Season 4 4.583 24 17761.
10. The Simpsons: Season 6 4.577 7967 2308.
11. Tenchi Muyo! Ryo Ohki 4.576 85 17183.
12. Veronica Mars: Season 1 4.570 1049 6490.
13. Lord of the Rings: The Return of the King: EE: Bonus Mat. 4.563 119 15627.
14. Arrested Development: Season 2 4.559 5763 2674.
15. Trailer Park Boys: Season 3 4.559 68 17522.
average of top 15 4.617 25355 7136
the arithmetic mean of item ratings in the training set: Scorej = |I1j | i∈Ij rij . Table 8
P
shows top 15 items with this criterion for the Netflix Prize data. We see that sorting by
average rating is not a satisfactory solution. The DVD “Trailer Park Boys: Season 4” with
only 24 ratings is in 9th place in the ranking.
Models with movie bias allow for another global ordering of movies. For example, in
the described earlier Bayesian model with user and movie biases µ + ci + dj , the posterior
distribution of a per-movie variable dj is Gaussian with expected value Edj and variance
V ar dj . Let’s see what happens if we sort movies according to Edj . In comparison to
sorting by average rating, Edj estimates are regularized (here shrinked towards zero),
and account for user bias. The resulting ranking is shown in table 9. The last column lists
tripled standard deviation as a measure ofpuncertainty. If dj is drawn from N (Edj , V ar dj )
distribution, dj lies in the range Edj ± 3 V ar dj with probability larger than 99.7%.
81
TV series with average rating 9.3/10 on IMDb (update: now it is 8.6/10 in 2012), but with
so few ratings in the Netflix Prize dataset it should not be in the top 15 here.
A better solution is to take into account the posterior standard deviation of the learned
parameter dj . We can suspect, that a good scoring criterion used for sorting should reward
a high mean rating, but should penalize high variance of prediction, because high variance
increases the risk of makingpa bad recommendation for a user. We can try a following score
function: scorej = Edj − 3 V ar dj (see section 3.4 “Evaluation” for some justification of
this formula). Then, if dj is actually drawn from N (Edj , V ar dj ), then dj > scorej with
probability larger than 99.8%. Table 10 shows the ranking sorted by scorej . The resulting
ranking has visibly better precision than two previous ones. The DVD “Veronica Mars:
Season 1” has the least ratings, 1049.
p 10: Ranking by expected movie bias corrected by triple standard deviation Edj −
Table
3 V ar dj : top 15 movies.
Avg. of Cnt. of Freq. p µp+ Edj
Title ratings ratings rank µ + Edj 3 V ar dj −3 V ar dj
(score)
1. LotR: The Return of the King: EE 4.723 72600 306. 4.673 0.010 4.663
2. LotR: The Fellowship of the Ring: EE 4.716 72274 303. 4.668 0.010 4.658
3. Lost: Season 1 4.678 5758 2522. 4.690 0.036 4.654
4. LotR: The Two Towers: EE 4.702 73630 295. 4.654 0.010 4.643
5. Battlestar Galactica: Season 1 4.669 1436 5603. 4.690 0.073 4.618
6. Arrested Development: Season 2 4.559 5763 2674. 4.629 0.036 4.593
7. The Sopranos: Season 5 4.532 20196 1200. 4.576 0.019 4.557
8. The Shawshank Redemption: SE 4.593 137812 51. 4.564 0.007 4.557
9. Veronica Mars: Season 1 4.570 1049 6490. 4.633 0.085 4.548
10. The Simpsons: Season 6 4.577 7967 2308. 4.567 0.031 4.536
11. Band of Brothers 4.512 36850 694. 4.544 0.014 4.530
12. The West Wing: Season 3 4.469 6433 2667. 4.543 0.034 4.509
13. The Simpsons: Season 5 4.542 17069 1423. 4.524 0.021 4.503
14. The Godfather 4.504 105707 130. 4.507 0.008 4.498
15. Seinfeld: Season 3 4.499 9084 2162. 4.527 0.029 4.498
average of top 15 4.590 38242 1922 4.312 0.028 4.571
Preparing lists of top items is a task commonly encountered in practice. I will describe
now a method used currently (2011 year) by IMDb (The Internet Movie Database) to
calculate top 250 movies based on users’ ratings. This method is used also on several other
websites that gather ratings. The IMDb method scores items as follows (only ratings of
regular voters are used):
Nj X λ 1 X
scorej = rij + µ=µ+ (rij − µ)
Nj + λ N +λ Nj + λ
i,rij ∈Rj i,rij ∈Rj
Movies with the number of ratings Nj < λ are discarded and don’t appear in the top-K
list.
In the version of scoring used to calculate top 250 movies on IMDb, the parameter λ
was set to 1500. The algorithm can be understood as assuming a common prior distribution
for all movies, and MAP point estimation for every movie. A large value of λ corresponds
to strong a-priori anticipation that a rating will be close to the global average µ (in other
words, the variance of the prior distribution is small). The chosen strong prior makes
difficult manipulating the ranking with shilling-type attacks, when multiple users agree to
rate high or low a chosen item. I can speculate that λ in IMDb algorithm is artificially
high to counteract both the uncertainty of MAP estimates (selecting movies to the top-K
is a multiple comparisons task – we have multiple chances to make a mistake) and the
82
possibility of ratings being not independent (like in situations of shilling attacks, which
effects are to prevent).
Table 11 shows the result of applying the IMDb sorting to the Netflix Prize data. We
see, that to get on the top 15 list in the IMDb method of ranking, a movie must have
not only large average rating, but also has to be very popular. The least frequently rated
movie on top 15 list is “Band of Brothers” with 36,850 ratings.
Some machine learning methods (and in particular a large part of the collaborative
filtering methods described in this work) can approximate the expected value, but do not
give an estimate of variance, so it may be useful to have a way to calculate a good quality
ranking of items only from the expected values, without having accurate estimates of
variances. For a movie bias dj , the standard deviation of the posterior distribution should
be roughly inversely proportional to the square root of the number of observations (ratings
given to movie j). I correct the ranking in the table 8 by subtracting C ∗ Nj−0.5 from the
average ratings. For large C the top 15 list becomes very similar to the list made by IMDb
method. Table 12 shows the ranking for C = 40:
The IMDb approach (table 11) corrects the average rating by an O(1/Nj ) term, where
Nj is the numberp of ratings of movie j, and the method in table 12 corrects the average
rating by an O(1/ Nj ) term, a crude estimate of the difference between posterior standard
deviations of items. The difference between these two ways of correcting the average seems
large, but the resulting rankings in tables 11 and 12 are almost identical (only places
12th and 13th are swapped). Both approaches use scores that with very high probability
bound the true unknown average (for the data distribution) from below. The precision
(understood as a chance of the top 15 containing no mistakes – no items that a user
will rate low) is higher than the earlier used in table 10 correction by tripled standard
deviation, at the cost of skipping items with larger, but uncertain predicted means.
The structure of which movies were rated (the structure of missing data) influences the
observed statistics, such as the observed average rating. If randomly selected users were
forced to rate the given movie (or better, forced to watch and rate), the movie average
would be much lower than the observed one (see [Mar08]). It makes a difference whether
an item is exposed to a group of users who like it, or to a group of users who do not like
it. The observed rating average depends on the website traffic patterns, the distribution
channel (movie theater, TV, DVD or VOD), item recurrence (a single movie, sequel of a
movie, TV series), and so on. For example, if someone rates all seasons of a TV series, he
83
p
Table 12: Ranking by arithmetic mean corrected by multiplicity (C = 40) of 1/ Nj : top
15 movies.
Title Avg. of Count of Freq. r−
ratings r ratings rank CNj−0.5 −CNj−0.5
1. LotR: The Return of the King: EE 4.723 72600 306. 0.148 4.574
2. LotR: The Fellowship of the Ring: EE 4.716 72274 303. 0.149 4.568
3. LotR: The Two Towers: EE 4.702 73630 295. 0.147 4.554
4. The Shawshank Redemption: Special Ed. 4.593 137812 51. 0.108 4.485
5. LotR: The Return of the King 4.546 133597 64. 0.109 4.436
6. Star Wars: Ep. V: The Empire Strikes Back 4.544 91187 192. 0.132 4.412
7. Raiders of the Lost Ark 4.504 117456 93. 0.117 4.387
8. The Godfather 4.504 105707 130. 0.123 4.381
9. Star Wars: Ep. IV: A New Hope 4.505 84480 232. 0.138 4.367
10. LotR: The Two Towers 4.461 150676 30. 0.103 4.357
11. Schindlers List 4.458 100518 155. 0.126 4.332
12. LotR: The Fellowship of the Ring 4.434 147932 33. 0.104 4.330
13. Star Wars: Ep. VI: Return of the Jedi 4.461 88041 215. 0.135 4.326
14. Finding Nemo (Widescreen) 4.415 139050 47. 0.107 4.308
15. Band of Brothers 4.512 36850 694. 0.208 4.304
average of top 15 4.539 103454 189 0.130 4.408
is likely a fan of the series – if someone saw the first season and did not like it, it is unlikely
that he saw and rated the remaining ones, and in effect, ratings for TV series are usually
higher than movie ratings (this effect is visible, for example, in the top few thousands most
frequently rated movies and TV series in IMDb). A similar effect can be spotted in movie
sequels, for example, it may explain why “The Bourne Ultimatum” (2007) the third part
of the Bourne trilogy, has highest average rating on IMDb of the three movies. We see
that there are justified doubts about accuracy of top-K lists without modelling in any way
the structure of missing data. To create a more accurate list of top-K items, we should
consider a problem such as: what would be the average rating, if all users watched and
rated the given movie. A few models described later in this work to some degree model
the structure of missing data, for example, Conditional RBM [Sal07a] or SVD++ [Bel07e],
but those methods were optimized to predict ratings from the training data distribution.
Evaluating and comparing whole top-lists can be useful, and from that perspective
we see that accurate evaluation of top items (and their ordering) is more important than
accurate evaluation of items with likely low rating. In [Kor08] the following ranking-based
method of assessing quality of top-K recommendations was proposed: for each of the
384, 573 five-star ratings from the Netflix probe set a prediction is calculated, and com-
pared with the prediction for 1000 random movies to be rated by the same user. The
1000 + 1 items are sorted by predicted rating, and the resulting rank (percentile) in the
set of the known relevant item is stored. After gathering the 384, 573 predicted values for
all relevant observations, we get a distribution of percentiles, which can be compared for
different algorithms. I used the evaluation criterion [Kor08] to compare rankings for the
model of two biases (the model used to produce p the ranking in table 10), for different
choices of α, when evaluating items by Edj − α V ar dj . The top-1.5% accuracy (top-15
of 1001) was 14% for α = 0 (ranking in table 9), 16% for α = 3 (ranking in table 10), and
the optimal α was about 400, with 33% top-1.5% accuracy, that is over 2.5 times larger
probability of entering top-1.5%, than when sorting by expected rating. The optimal coef-
ficient is large (α ≈ 400) likely because selecting top-K set among N items with uncertain
scores (described by posterior rating distributions) is a situation of multiple comparisons
(many items have a chance to exceed the score threshold needed to enter the top-K list).
Similarly large differences in evaluating accuracy of rankings should appear when apply-
ing variance-based corrections of different size to more complex algorithms described later
84
in the work. This was the only experiment I performed with evaluating rankings. In the
rest of the work I focused on the well defined task of predicting the expected rating, and
assumed that there exist efficient ways to adapt the resulting algorithms to calculate good
quality personalized recommendations.
The above method shares disadvantages of other commonly used ranking-based evalua-
tion measures, such as MAP (mean average precision - average precision at the rank of each
relevant item), ATOP (area under top-k recall curve) [Ste10b], NDCG (Normalized Dis-
counted Cumulative Gain), or relaxed versions of these kinds of measures [Yue07]. Those
typically used measures ignore the missing data structure, and also, direct optimization of
ranking-based measures may have too large computational complexity [Wei07] (useful can
be simplifications, such as based on neural networks [Bur05, Ren09, Jah11b]). Attempts
to modify the ranking-based measures to account for missing data are not completely
satisfying so far, for example, the task in Track 2 of the KDD Cup 2011 [Dro11] is to dis-
tinguish high ratings from specially sampled missing data, a criterion that, in my opinion,
puts too much weight on predicting the missing data (which does not have to be closely
connected to user likes and dislikes), while ignoring the valuable information contained in
low ratings.
The amount of similarity between items on top-K lists should also be evaluated. Filling
a short list of top-K items (or a list of personalized recommendations) with very similar
items is undesirable, and the scoring function should penalize presence of similar items.
We could design a criterion evaluating lists that penalizes dependencies between prediction
errors, for example by rewarding high probability that at least one item from the proposed
list will enter the true top-K list. Optimizing such joint criterion may have too large
complexity, so a better idea may be deciding on a simpler heuristic solution: removing an
item from the calculated top-list, if it is similar to another item higher on the list. Another
heuristic solution (mentioned in [Pat07]) is to calculate clustering of items, and allowing
on a top-K list only items from different clusters. A related requirement is that some
items should not be recommended before another items are rated by the user, for example
a sequel typically should not be recommender before the first movie, and subsequent
seasons of a TV series should not be recommended before rating (watching) the previous
seasons.
Another issue in practical applications is that a system of personalized recommenda-
tions, and even a simple top item list, are systems with loopback: the displayed lists of
items depend on users’ ratings, but the lists are part of website navigation, and which
items are rated by a user depends on which items are displayed. Presence of the loopback
construction causes a risk that if we too much concentrate on precision of recommenda-
tions, not allowing for discovery of very good items, then new items will not have a chance
to get many ratings. In recommender systems based on implicit feedback, like clickstream
data, there is yet another risk, that the recommendation lists will become polluted by
items, that became popular accidentally, because of being exposed in website navigation
or because they are inaccurately described, and their popularity was further amplified by
the recommender system (or by reaching a list of top items). In clickstream-only based
systems a user does not have an easy possibility to indicate, that he does not like a rec-
ommended item.
In summary, the subject of learning to rank is broad and, looking at applications,
connected to human-computer interaction, interface design, psychometrics, advertising
research, search, and other domains. The main focus of this work is rating prediction, so
I stop on the above preliminary analysis of the problem of ranking items. To perform a
more complete analysis more data is needed, such as proper surveys or elicitations, that
would tell more about the user’s perspective.
The most important conclusion from the simplified analysis is that to calculate a good
85
quality list of top items (or a list of recommendations) it is not enough to estimate the
expected rating of items. It is necessary to take into account the uncertainty of prediction,
which decreases with increasing number of collected ratings for an item. The proposed
correction is in the form of decreasing the expected rating by padding dependent on the
predicted
p uncertainty of a rating. Two types of correction worked well: by a factor of
O(1/ Nj ) and by a factor of O(1/Nj ), where Nj is the number of ratings of movie j.
Which sorting criterion is the right one, this depends on how we evaluate errors in top-K
lists (how we define our loss function). I advocate in this work
p using a heuristic – sorting by
the predicted expected rating corrected by subtracting C/ Nj + λ (the form of correction
was to some extent justified in section 3.4 “Evaluation”). The amount of correction should
be tuned to determine the right balance between recommending popular, trustworthy
content, and allowing for content discovery among new items or niche items from the
“long tail” (items with few ratings). The amount of correction should be user-specific,
because users can have different expectations about taking risks, but for simplicity we
can use one global constant C. Because selecting a few items from a set is a multiple
comparisons situation, the amount of correction C grows with the number of compared
items – a different sorting criterion is needed when we calculate a top-10 list among 50
items, among 1000 items, or among a million items.
Other issues to consider are understanding and correcting for the missing data struc-
ture, which can contain artifacts caused by interaction of recommendations with website
navigation, and amplified by loopback effects. Which groups of items are exposed to which
groups of users, this influences the observed average rating. Also, a solution ensuring di-
versity of lists can be needed.
Comparing to selecting the global top-K by average rating, in my judgement the most
important corrections are, in order of importance:
1. corrections for uncertainty of predictions (posterior variance),
2. corrections for missing data structure (can be more important in other datasets,
depending on how the ratings are gathered).
3. regularized estimates (choice of priors in the Bayesian approach; regularization be-
comes more important in the algorithms for personalized recommendations).
86
The parameter vjk denotes the level of the k-th feature of movie j, and can be un-
derstood as an automatically learned movie genre. The parameter uik denotes the level
of preference of user i for k-th movie genre. Such automatically learned movie genres do
not necessarily have lot in common with the genres named by human experts, like action
movie, comedy, horror. It can be a combination of many named genres. It may also express
some common traits of movies, that have not yet been named.
The biases ci and dj are a variant of global effects [Fun06, Bel07b, Pot08, Tos08b].
Usually a little better accuracy was obtained by treating biases ci and dj as a component
of the model [Pat07, Tak07a, Tak07b, Tak09a, Bel08] than learning biases separately from
the main model, in a phase of preprocessing by removing global effects.
This section examines a simplified version of the above model, that contains only one
feature (hidden genre):
r̂ij = µ + ci + dj + ui vj
I will compare from the perspective of predictive accuracy different possible ways of
learning parameters. I will reflect on what is the best form of a prior distribution for
parameters. At the end I will summarize experiments with nonparametric modelling of a
two-parameter function connecting user preferences with a movie feature. Its goal was to
verify, if multiplying ui vj in SVD is the right choice for expressing the hidden relationship.
Note that the general problem of calculating approximate SVD with missing data is
NP-hard, even for one-dimensional approximation [Gil10], but in practice, for the Netflix
data, regularized SVD methods with a right choice of optimization way seem to converge
to solutions close to global minima.
Table 13 shows a comparison of accuracy for different methods of learning the param-
eters c, d, u, v. In these experiments the entire probe set was used as the test set. Let’s
describe the methods listed in the table.
The first listed method, “SVD gradient descent, MAP biases once”, is regularized SVD
(with one feature) based on RMSE cost function minimization with regularization. The
method was proposed in [Fun06], with biases learned as preprocessing before learning ui ,
vj parameters. The regularized cost function to minimize is following:
X λu 2 λv λc λd
l= (rij − µ − ci − dj − ui vj )2 + Ni ui + Nj vj2 + c2i + d2j .
2 2 2 2
ij∈T r
The parameters ui ,vj are learned by following the negated first derivative of the cost
function, resulting in the following updates:
resij = rij − µ − ci − dj − ui vj
ui + = lrate ∗ (resij ∗ vj − λu ∗ ui )
vj + = lrate ∗ (resij ∗ ui − λv ∗ vi )
The constant parameters were fixed to values proposed in [Fun06]: lrate = 0.001, λu =
λv = 0.02. Biases were optimized once as a preprocessing, with constant regularization
λc = λd = 5.0.
Accuracy is improved by treating biases ci , dj as part of the model (“SVD gradient de-
scent, MAP biases simul.”), and learning them simultaneously with ui , vj [Pat07, Tak07a,
Tak07b] ([Pat07] used a special regularization of biases (Ni + Nj )λcd (ci + dj )2 , chosen
experimentally among several alternatives).
The next group of methods are approximations of Bayesian inference of parame-
ters, which are treated now as random variables. A prior distribution of parameters
θ = (ui , vj , ci , dj ) is assumed and we want to calculate the posterior distributions on
87
Table 13: One-feature SVD with biases. Comparison of experimental results.
Method Parameters Iterations RMSE
SVD grad. descent, MAP biases once lrate = 0.001, λuv = 0.02, λcd = 5.0 100 0.9625
SVD grad. descent, MAP biases simul. lrate = 0.001, λuv = 0.02, λcd = 5.0 100 0.9531
SVD Bayes., Gibbs, avg. expectation τv2 = 1 10 + 40 0.9520
SVD Bayes., Gibbs, avg. of samples τv2 = 1 10 + 40 0.9524
SVD Variational Bayesian τv2 = 1 20 0.9510
SVD Variational Bayesian τu2 = 1 20 0.9510
SVD MAP, opt. λcd = 5, λu = (4.0, 0.04), λv = (6.0, 0.011) 10 0.9511
SVD gradient descent w/opt. param. lrate = 0.001, λcd = 5, λu = (4.0, 0.04), 100 0.9518
λv = (6.0, 0.011)
SVD gradient descent, reg. |ui |3 , |vj |3 lrate = 0.002, λcd = 5, λu = 1.5 ∗ (4.0, 0.04), 100 0.9515
λv = 1.5 ∗ (6.0, 0.011)
SVD MAP, regularization |ui |3 , |vj |3 λcd = 5, λu = 1.5 ∗ (4.0, 0.04), 10 0.9512
λv = 1.5 ∗ (6.0, 0.011)
SVD Bayesian, nonparam. prior 20 + 50 + 50 0.9578
Nonparametric rel., avg. samples 100 + 100 0.9498
these parameters according to Bayes’ rule: p(θ|D) ∝ p(D|θ)p(θ), where D are the avail-
able data (observations), and θ is a vector of all model parameters. Calculations directly
by Bayes’ rule have here too large computational complexity to be applicable in practice,
and it is necessary to use approximations. Two ways of approximation were used here
for parameter inference in the one-feature SVD model: the first is MCMC with Gibbs
sampling [Har07, Tom07, Sal08], and the second is Variational Bayes [Lim07, Rai07].
In the probabilistic model used, prior distributions of each parameter are independent
Gaussians, differently parameterized for each group of parameters c, d, u, v. We can say
that the model has a property of exchangeability of users (users are indistinguishable
a-priori, before seeing the data) and exchangeability of movies.
88
1 1
V ar ci = V ar dj =
|Ji |/σ 2 + 1/τc2 |Ij |/σ 2+ 1/τd2
P − µ − dj − ui vj )
j∈Ji (rij c
Eci = + ∗ V ar ci
σ2 τc2
P
i∈Ij (rij − µ − ci − ui vj ) d
Edj = + ∗ V ar dj
σ2 τd2
The parameters c, d, τc2 , τd2 are estimated, similarly as earlier in section 4.2.3, by max-
imum likelihood from posterior samples of ci , dj (the method can be called Monte Carlo
Empirical Bayes):
N N M M
1 X 1 X 1 X 1 X
c= ci τc2 = (ci − c)2 d= dj τd2 = (dj − d)2
N N M M
i=1 i=1 j=1 j=1
The conditional posterior distribution for ui in the Gibbs sampling has the form:
P 2
j∈Ji (rij − µ − ci − dj − ui vj ) (ui − u)2
p(ui |D, vj , ci , dj ) ∝ p(D|ui )p(ui ) = f (ui ) = exp − −
2σ 2 2τu2
P 2
P
j∈Ji (rij − µ − ci − dj ) ∗ vj u 1 2 j∈Ji vj 1
= exp − C + ui ∗ + − u ∗ +
σ2 τu2 2 i σ2 τu2
We see that, with all remaining parameters fixed, the posterior distribution of ui is
Gaussian: (u − Eu )2
i i
p(ui |D, vj , ci , dj ) ∝ exp −
2 V ar ui
1
V ar ui = P 2
( j∈Ji vj )/σ 2 + 1/τu2
P
j∈Ji (rij − µ − ci − dj ) u
Eui = + ∗ V ar ui (20)
σ2 τu2
The hyperparameters u, τu2 of the common prior distribution of ui ’s are estimated,
similarly to the hyperparameters of c and d, by maximum likelihood point estimation on
a set of posterior samples of ui ’s:
N N
1 X 1 X
u= ui τu2 = (ui − u)2
N N
i=1 i=1
1
V ar vj = P
( j∈Ij u2i )/σ 2 + 1/τd2
P
i∈Ij (rij − µ − ci − dj ) ∗ ui v
Evj = + ∗ V ar vj (21)
σ2 τv2
The hyperparameters v, τv2 are point estimated based on sampled vj values:
M M
1 X 1 X
v= vj τv2 = (vj − vj )2
M M
j=1 j=1
89
The parameter σ 2 is estimated using sampled ci , dj , ui , vj :
1 X
σ2 = (rij − µ − ci − dj − ui vj )2
|T r|
ij∈T r
Alternating Gibbs sampling of ci , dj , ui , vj was run for 50 iterations, with fixed param-
eter τv = 1. The first 10 iterations were skipped as a “burn-in” phase, and predictions
were made in two ways. The first was averaging sampled ci + dj + ui vj from the last 40
iterations, which gave RMSEprobe = 0.9524. The second method was predicting by av-
erage of expected outputs: Eci + Edj + Eui Evj from the last 40 iterations, which gave
RMSEprobe = 0.9520.
Variational Bayesian (VB) inference [Lim07, Rai07, Bis06] for the one-feature SVD
model (“SVD Variational Bayesian” in table 13) results in similar formulas as MCMC. The
VB method assumes independence of four groups of parameters c, d, u, v in the posterior
distribution: q(c)q(d)q(u)q(v). Using variational methods, a probability density function
in a factorized form q(c)q(d)q(u)q(v) is found [Bis06] that minimizes the Kullback-Leibler
divergence from the joint distribution p(c, d, u, v|D). Minimization leads to solutions in the
form log q(u) = C + Ec,d,v log p(c, d, u, v, D), and analogous expressions for q(v), q(c), q(d).
The resulting method is similar to MCMC, but, instead of sampling, the distributions
of c, d, u, v are alternately updated. Making the same assumptions about Gaussian priors
as in the MCMC method, the posterior distributions of c, d, u, v in VB are Gaussians
N (Eci , V ar ci ), N (Edj , V ar dj ), N (Eui , V ar ui ), N (Evj , V ar vj ).
The parametrization of the posterior distributions of biases ci and dj is following:
1 1
V ar ci = V ar dj =
|Ji |/σ 2 + 1/τc2 |Ij |/σ 2+ 1/τd2
P − µ − Edj − Eui Evj )
j∈Ji (rij c
Eci = ∗ V ar ci +
σ2 τc2
P
i∈Ij (rij − µ − Eci − Eui Evj ) d
Edj = + ∗ V ar dj
σ2 τd2
Point estimation of the hyperparameters c, d, τc2 , τd2 :
N N
1 X 1 X
c= Eci τc2 = ((Eci )2 + V ar ci ) − c2
N N
i=1 i=1
M M
1 X 1 X 2
d= Edj τd2 = ((Edj )2 + V ar dj ) − d
M M
j=1 j=1
90
Point estimation of the hyperparameters u, v, τu2 , τv2 :
N N
1 X 1 X
u= Eui τu2 = ((Eui )2 + V ar ui ) − u2
N N
i=1 i=1
M M
1 X 1 X
v= Evj τv2 = ((Evj )2 + V ar vj ) − v 2
M M
j=1 j=1
Table 13 listed results of running two variants of VB method for 20 iterations. In the
first method τu = 1 was fixed, and in the second one τv = 1 was fixed. Both methods
gave identical RMSEprobe = 0.9510, the best among the tested here SVD methods with
one feature.
One inconsistency in the VB method used is that the prior distributions for ui and
vj , the posterior distributions for ui and vj , and the output distribution rij are all Gaus-
sians. Investigating this observation led to experiments with nonparametric priors and and
proposing a different choice of priors (see later parts of the section).
Another observation, after comparing the equations (23),(25) of VB with the equations
(20),(21) of MCMC, was that the VB method underestimates the posterior variance ui
and vj . Comparing (20) and (23), the sampled vj2 ’s changed to (Evj )2 + V ar vj , and yij vj
changed to yij Evj , where yij = rij − µ − Eci − Edj . The first change to (Evj )2 + V ar vj
is a good approximation of the influence of the vj2 term, but the second change yij vj to
yij Evj causes decreasing the variance.
P Approximate corrections of the posterior variances
by V ar ui := V ar ui + (V ar ui ) 2
j∈Ji yij V ar vj and analogous corrections of V ar vj
were too small to significantly influence the RMSE accuracy.
Variational Bayesian SVD equations explain the phenomenon initially noticed experi-
mentally [Fun06], that in the cost function minimization approaches to SVD (called here
also a neural network approach to SVD) better than constant regularization, identical for
each user (each movie), is the amount of regularization increasing linearly with the num-
ber of user ratings (movie ratings). Constant regularization is a typical choice in machine
learning, used in neural networks (called there weight decay), structural risk minimiza-
tion, kernel methods and in other approaches. In the Netflix Prize task a more refined
regularization term leads to avoiding overfitting to a large extent, and to a large accu-
racy improvement. In the formula (22) for the posterior variance of ui visible is a term
2
P
j∈Ji V ar vi /σ , that grows linearly with the number of ratings collected (and similarly
in the formula (24) for the posterior distribution for vj ).
If we assume the hyperparameters u = v = 0, the VB solution can be approximated
by alternating MAP (maximum a-posteriori) estimation with a specially selected linear
regularization λ(1) + λ(2) |Ij |. Earlier were described two SVD methods with linear regu-
(1) (1)
larization (but without the constant terms, λu = λv = 0), listed in table 13 as “SVD
gradient descent”. The method “SVD MAP, opt.” from table 13 uses nearly the same cost
function, but, instead of optimizing it with gradient descent, the four groups of parameters
c, d, u, v are alternately optimized by a jump to the marginal minimum. Regularization is
constant for parameters ci and dj , but increases linearly with the number of user ratings
for estimating ui and with the number of movie ratings for estimating vj .
91
yij = rij − (µ + ci + dj + ui vj )
P P
j∈Ji (yij + ci ) i∈Ij (yij + dj )
ci := dj :=
|Ji | + λc |Ij | + λd
P P
j∈Ji (yij + ui vj )vj i∈Ij (yij + ui vj )ui
ui := P (1) (2)
v j := (1) (2)
2 2
P
j∈Ji vj + λv + λv |Ji | i∈Ij ui + λu + λu |Ij |
(1) (2) (1) (2)
In this method there are six regularization parameters: λc , λd , λu , λu , λv , λv . Their
values were set by minimization of RMSE on the test set, performed by Praxis [Bre71]
optimizer, from Fortran Netlib library. The resulting RMSEprobe of the method is 0.9511.
The parameters minimizing RMSE were λcd = 5.0, λu = (4.0, 0.04), λv = (6.0, 0.011) (the
parameters were rounded, and the rounding did not change the RMSEprobe ).
The method “SVD gradient descent w/opt. param.” optimizes exactly the same cost
function as “SVD MAP, opt.”, but with gradient descent (a first-order method) instead
of alternating jumps to the minimum (a second order method).
The methods “SVD gradient descent, regularization |ui |3 , |vj |3 ” and “SVD MAP, reg-
ularization |ui |3 , |vj |3 ” use a special regularization suggested by the experiments with
nonparametric priors.
The method denoted in table 13 as “SVD Bayesian, nonparam. prior” explores the idea
of using nonparametric priors. Methods described so far in this section were more or less
accurate approximations of the Bayesian approach in a model that assumes independent
Gaussians as prior distributions on hidden parameters. Good practice when identifying or
verifying a model is to use first a model with more parameters than necessary, and then
try to use noticed regularities, patterns, dependencies, known distributions to simplify
the model, decrease the number of parameters. One may wonder, if a Gaussian is the
right choice for the prior distributions. I attempted to verify it by learning priors in a
nonparametric form and inspecting the learned distributions, similarly as I did it for the
model of biases in section 4.2.3. Again, I use here a prior in the form of Dirichlet process
with zeroed concentration parameter. The prior distribution is initialized here to a set of
uniformly sampled values from a 501-valued grid of equidistant points in the range [−5, 5].
The method is called nonparametric, because the grid density can be increased (each
grid value is a new parameter) with increasing number of users (movies). The resulting
algorithm is very similar to the nonparametric algorithm for biases described earlier in
section 4.2.3:
1. Initialize c, d, u, v to random values from
the discrete set [−5, −4.98, −4.96, −4.94, ..., 4.98, 5]
2. for (iter in 1..100)
3. update c(c)
4. update d(d)
5. update u(u)
6. update v(v)
7. c, d, u, v = mean of c, d, u, v vectors from iterations >= 50
8. predict using ci + dj + ui v j
The functions update u(), update v(), update c(), update d() are similar to the update
functions in the algorithm for biases in section 4.2.3. The function update u() is following
(the remaining ones being analogous):
92
1. update u(u):
3. unew = random shuffle(u)
4. for (i in 1..480189) // loop over users
5. l1 = log likelihood u(ui , Ri )
6. l2 = log likelihood u(unew
i , Ri )
7. if (exp(l1 − l2) > random())
8. ui = unew
i
The function log likelihood u(x, Ji ) returns the logarithm of the likelihood of ui = x,
given the observed data, and assuming that all ci , dj , vj parameters are fixed. I assume here,
that the data comes from the distribution N (µ+ci +dj +ui vj , σ 2 ), hence log likelihood u(x,
Ji ) = 2σ1 2 j∈Ji (rij −µ−ci −dj −xvj )2 . The function random() returns a sample from uni-
P
form distribution [0, 1], and the function random shuffle(x) returns a random permutation
of the vector x, each permutation with the same probability.
As in the algorithm in section 4.2.3, I assume σ 2 = 0.5. Changing of this parameter to
a more accurate value has little influence on the resulting distributions.
Let’s visualize the posterior distributions, comparing the results of learning the para-
metric and nonparametric method. The resulting distributions of biases c, d have a similar
shape to the learned in the “only biases” model (section 4.2.3), so to save space I will skip
their visualization. Plots A, B in figure 27 visualize the posterior distributions of u and v
for the parametric method (the one with Gibbs sampling). Plots C, D visualize the distri-
butions for the nonparametric method (the set of samples from the posterior distributions
becomes a new prior in the subsequent iteration). Plots E, F are quantile-quantile plots
comparing the normalized (centralized and standardized) posterior distributions of u and
v with the standard Gaussian N (0, 1).
Summarizing the charts A and B in figure 27, for the parametric method the dis-
tribution of posterior samples of ui roughly coincides with the prior distribution of ui
(estimated in the previous iteration of the algorithm). In turn, the posterior samples of vj
are concentrated in a narrow range, comparing to the vj prior, which variance was fixed
to τv2 = 1.
Interesting is the shape of the learned nonparametric priors on u and v observed on
charts C, D, E, F. Noticeable are the kurtosis parameters γu = −0.73, and γv = −0.78 (the
estimates may be inexact). In the parametric methods Gaussians were used as priors, for
which the kurtosis is zero. An attempt of using different priors was using a generalized nor-
mal distribution: p(x) ∝ exp(−C|x|α ), which has negative kurtosis for powers alpha > 2.
Initial experimental results showed a small accuracy improvement with α = 3 (note that
then the distribution is bimodal) over α = 2 with gradient descent approach (“SVD gra-
dient descent, regularization |ui |3 , |vj |3 ” in table 13) and a small accuracy reduction when
using alternating minimization (“SVD MAP, regularization |ui |3 , |vj |3 ” in table 13). More
experiments are needed to conclude whether a change of priors can significantly improve
accuracy of the regularized SVD methods.
The learned nonparametric distributions for u and v with negative kurtosis, the form of
the model N (ci +dj +ui vj , σ 2 ) containing ui vj term, and the form of learned nonparametric
priors for ci and dj (close to Gaussians) suggest that the term ui vj should be a-priori
approximately Gaussian. This thought leads us to an interesting, simple question in the
field of probability: which two symmetrical, iid variables X and Y after multiplying give
a standard Gaussian variable XY = Z ∼ N (0, 1)? As it often happens, simple to state
questions about random variables can lead to complicated solutions. A question for which I
do not know the answer is what is the pdf of the X and Y variables. What is known is that
the moments are square roots of the standard Gaussian distribution (the odd moments
are zero), so it is easy to calculate the kurtosis and compare with the observed values. The
even moments of N (0, 1) are mp = (p − 1)!! = Πi (2p − 1). The even moments of the desired
93
Figure 27: Learned priors of u, v; in the parametric and nonparametric method
94
p
distribution of X and Y are hence (p − 1)!!. The kurtosis for the Gaussian √ distribution
√
is m4 /m22 − 3 = 0, and for the distribution of X and Y is m4 /m2 − 3 = 3 − 3 ≈ −1.268,
so it is smaller than the observed one in our nonparametric method (with the remarks
that our procedure of estimating the kurtosis may be inexact, and the output distribution
is not precisely Gaussian).
Summarizing the experiments so far, better priors than Gaussian can be chosen in
the probabilistic SVD model. Another thing to improve approximation of the MCMC
method is to use a different method than VB with minimizing KL distance. Minimizing
KL distance with Gaussian priors leads to Gaussian posteriors and hence approximating
ratings with sums of spike-shaped distributions – normal product distributions. VB SVD
with minimizing Hellinger distance could be tried out (but here calculations complicate).
To gain further insights into the SVD methods used here, I looked at synthetic data
in the form ui vj + ij , where ui and vj are once drawn from N (0, 1), and each error ij
is drawn also from N (0, 1). A noticed problem is that the influence of priors on u and v
(here N (0, 1)) on the obtained posteriors is small. In 10 different draws of the data, and
re-running a regularized SVD method I obtained the following estimates of variance τ̂u2 :
(0.11, 0.98, 0.01, 0.77, 0.42, 1.34, 0.22, 1.64, 0.55, 0.43). In multi-feature SVD appears
another problem – often appearing multicollinearity (once per several draws of data) in
one of the columns or rows, causing large increase of RMSE. We can speak therefore of
bad conditioning of the task of recovering u and v parameters, when each ui and vj are
drawn from Gaussians, and the data matrix is drawn from N (ui vj , σ 2 ).
The last method listed in the table 13 is “nonparametric relationship, avg. samples”.
All SVD methods presented earlier in this section contained the term ui vj . The choice of
multiplication as the expression connecting numeric user preferences with movie features
(hidden genres) is arbitrary and should be verified (note the redundancy between changing
the function connecting ui with vj and changing the priors for ui and vj ). To see which
function f (ui , vj ) is the proper one I will try nonparametric modelling. The method used is
a two-dimensional variant of the K-means algorithm or fuzzy C-means: each user becomes
assigned to one of 100 clusters, and similarly, each movie becomes assigned to one of 100
another clusters. The model used is following: each centre µab of 100 × 100 rating clusters
is drawn from N (0, σµ2 ). Cluster a(i) of each user i and cluster b(j) of each movie j are
drawn uniformly from the integer set 1..100. Then all ratings are drawn from N (µ + ci +
dj + µab , σ 2 ). Predictions are made using i, j sampled from their posterior distributions
after observing the data, and using the current iteration’s estimates of µij . The function
µab = g(a, b) modelling the relationship between clusters is estimated by a regularized
average of in the group of users belonging to the cluster a, that rate movies from the
cluster b: P
( ij∈Aab (rij − µ − ci − dj )) + µ ∗ λ
µab = g(a, b) =
Nab + λ
with λ fixed to 100. In each subsequent iteration the clusters for users and movies are
sampled using the Metropolis-Hastings method, that compares likelihood of two clusters:
the previous one, and a randomly (uniformly) chosen new one. 100 iterations were run as
burn-in, and additional 100 to calculate the averaged predictions. The averaged predictions
gave very good RMSE 0.9498, the best among all methods listed in this section, but it
should be noted, that the learned function g(a, b) encapsulates more information than only
one user-movie feature. Repeatedly learning the algorithm on the residuals of the same
algorithm (in order to learn a new function representing more features) did not improve
RMSE, likely because of overfitting.
The result of running the method for 200 iterations is a temporary assignment of users
and movies to clusters and a function g(a, b) with estimates of average rating in 100 × 100
clusters. After learning the coordinates a, b were unordered. To visualize the content of
95
the matrix, I chose the largest value yab in the matrix, and exchanged a to 100, and b to
100. Then I sorted the matrix rows by g(a, 100) and matrix columns by g(100, b), yielding
a permutation of the original matrix: Anew = P1 AP2 . In the new matrix the largest value
is in the cell g(100, 100), and g(a, 100) and g(100, b) are ordered sequences. Plot 28 shows
selected columns of the permuted matrix (g(a, b) was renamed as rel(i, j)).
A plot for the varying second coordinate b is similar to the plot 28 for index a. To save
space it is skipped.
Plot 29 shows four 3D visualizations of the sorted relation g(a, b), where the neighboring
groups of 10 rows and 10 columns were averaged. The first chart plots averaged g(a, b) in
coordinates X1, Y 1 chosen so that the averaged values of g(a, b) on two edges are straight
lines. The second chart uses coordinates X2, Y 2 in which the averaged g(a, b) on the other
two edges are straight lines. The third and fourth (bottom) charts use coordinates U, V
from SVD of g(a, b) (the coordinates are averaged in groups of 10). The third chart plots
the averaged g(a, b) values, and the fourth chart plots the ua vb approximation of g(a, b).
We see that the learned function is close to the multiplication operation for some
choice of ua values assigned to user clusters a, and some choice of vb values assigned to
movie clusters b, but the learned function is not exactly the multiplication. After fitting
parameters in several functional forms by a general purpose optimization procedure nlm()
in R, and after removing low significant terms, the best approximation found of g(a, b)
was following:
r̂ab = α + ua vb + βu2a + γvb2
where all parameters ub , vb , and a, b, c were fitted by the nlm() optimizer in R. The resulting
weights were α = 0.04, β = −0.16, γ = −0.24. The learned function g(a, b) seems to
be between ua vb from the regularized SVD model, and (ua − vb )2 from the Euclidean
embedding used in [Kho10] (of course to confirm or reject this hypothesis experiments on
SVD with more features are needed). The observed results justify to some degree using
the multiplication in SVD-type models for the Netflix data.
Summarizing the experiments with nonparametric learning of priors and nonparamet-
ric learning of the connecting two-parameter function, the results of experiments suggest
to choose different priors for u, v parameters in regularized SVD, and possibly to change
the multiplication ui vj to another function (but close to the multiplication). Whether it is
possible to improve the predictive accuracy of regularized SVD type methods by following
those suggestions, this is a topic for further research.
The crux subproblem appearing in regularized SVD methods is repeatedly solving a
Bayesian linear regression equation with errors in predictors: yij = ui (v j + j ) + (and an
analogous equation for vj ). Choices to make here are to decide on the right prior for ui
(a proper short-tailed distribution), the right form of error and errors in predictors j
(assuming here that the predictors v j are constant), and the right computational procedure
to calculate the posterior ui . The error term may need to be modelled separately per
observation ij [Tom07]. The best performing one-feature SVD method in this section was
the VB approximation with the assumption of Gaussian prior for ui , Gaussian , and
Gaussian errors in predictors j (and correspondingly, also all Gaussians for inference of
vj ). More experiments on the Netflix data are needed to find out if the assumptions of the
model can be improved, and if the VB approximation of the Bayesian approach gives here
best possible results.
Let’s summarize the whole section. I described several ways of learning the one-feature
SVD model with biases ci +dj +ui vj , and compared the resulting RMSEs. I examined neural
networks-like approaches based on minimizing a regularized cost function, and Bayesian,
probability-based approaches. In the cost function minimization approach I considered
various forms of regularization, and different methods of optimization: gradient descent
(a first order method) and alternating jumps to minimum for subsequent parameters (a
96
Figure 28: Ordered columns of the learned function
97
Figure 29: Visualization of the learned function
Learned function f(X1,Y1) Learned function f(X2,Y2)
REL 1 1
REL
0 0
−1 −1
−1.0 0.0
−0.5
Y1 0.0 0.5
−1.0 Y2 0.0
0.5 0.0 −0.5 1.0 0.5
0.5 1.0
X1 X2
1 1
REL
UV
0 0
−1 −1
−0.5 −0.5
0.0 0.0
V 0.5 V 0.5
0.0 −0.5 0.0 −0.5
1.0 0.5 1.0 0.5
U U
second order method). It was important to use a regularization increasing linearly with
the increasing number of observations. In the Bayesian approach I used MCMC and VB
approximations, assuming Gaussian priors for parameters in the model. The Gaussian
assumption was verified by using a prior in nonparametric form. It turned out that the
nonparametric method suggests different priors than Gaussians for ui and vj . I attempted
also to verify, whether the multiplication ui vj is the right operation connecting numerical
user preferences with a movie feature (learned genre), and the experiment to a large degree
justified the choice of multiplication.
98
4.3 Regularized Singular Value Decomposition
In section 4.2 I described simple models, ending with the analysis of the regularized SVD
model in its simplest form with only one feature. Now I extend the analysis to the regu-
larized SVD with 30 − 200 features. Variants of regularized SVD, extended by using time
information (see section 4.3.6), and with improved modelling of user preferences (4.3.5),
were methods with the best accuracy on the Netflix Prize task.
SVD-type methods (and matrix factorization methods in general) contain a compo-
nent uTi vj , where the vector of real numbers vj represents automatically learned features
of movie j, and the vector ui represents the learned user preferences for each of the cor-
responding features vjk .
As a naming convention, I will call “regularized SVD” those of methods containing
the factorization component that are similar, by the form of the model and by the result
of learning the parameters, to the standard SVD from linear algebra. In particular, regu-
larized SVD variants sort features roughly from largest to smallest magnitudes. I will call
“matrix factorizations” methods that learn any rotation of U 0 V 0T = U CC T V T , without
sorting the columns according to their magnitudes, and with the columns of U and V
not necessarily close to orthogonality. I will count as matrix factorizations also methods
restricting values of U and V , like non-negative MF, as well as generalized factorizations,
where the output depends from the factorization component through a nonlinear link func-
tion yij = g(uTi vj ). Matrix factorizations are one of the possible ways to capture subsets of
matrix entries (user, movie) that have the expected rating larger (or smaller, respectively)
than explained (predicted so far) by global effects and by the remaining part of the model.
Before moving on to the extensive description of the full-featured regularized SVD in
the next sections, let’s first recall the standard SVD method known from linear algebra.
Singular Value Decomposition factorizes the matrix A of size n × m into a product of
three matrices: A = U ΣV T , where U of size n × k and V of size m × k are matrices with
orthonormal columns (U T U = V T V = Ik ), and Σ is a diagonal matrix of size k × k, k =
min(n, m). The values on the diagonal of Σ are sorted in decreasing order, along with the
corresponding columns of U and V . SVD is related to eigenvalue decomposition: matrices
AT A i AAT factorize into AT A = V Σ2 V T and AAT = U Σ2 U T . Principal Components
Analysis (PCA), a popular technique of statistical data analysis, which uses eigenvalue
decomposition, can also be performed by SVD of the data matrix with centered columns.
If we leave only k <= min(n, m) largest singular vectors in the SVD, the matrix
Â(k) = Uk Σk VkT is an optimal low-rank approximation of A, as measured with Frobenius
norm: the difference between Â(k) and A has minimal Frobenius norm ||A − Â(k) ||F =
q Pn Pm (k) 2
i=1 j=1 (aij − âij ) among all matrices with rank at most k. This fact is conve-
nient, because our Netflix Prize task is to minimize test RMSE, and that is Frobenius
norm calculated for some subset of matrix cells. If only a small fraction of values were
missing in the data matrix, standard linear algebra SVD could work well with repeated
imputing the missing values with the results of the SVD approximation. The first at-
tempts to use SVD-type algorithms for collaborative filtering were to treat missing values
as zeros [Bil98], fill missing values with average item rating [Sar00], use a dense subset
of data [Gol01], or to use a sparse SVD without regularization, imputing missing val-
ues with expectation-maximization-like approximations in a computationally efficient way
[Bra02, Bra03, Zha05, Kur07, All10]. For the Netflix Prize data, where about 98.9% of
matrix entries are missing, the problem is that imputation methods cause overfitting the
data and bad prediction accuracy. Proper approximate Bayesian inference is much more
accurate, leading to regularized SVD-type methods, which have roughly similar form to
linear algebra SVD with sorted singular values, but lose properties like exact orthogonality
of columns of U and V , and the optimization task may have local minima.
99
The models that are called here regularized SVD, and are described in detail in the
following sections, all have the form:
K
X
r̂ij = µ + ci + dj + uik vjk = µ + ci + dj + uTi vj .
k=1
Similarly to the idea of centering columns in PCA, here biases are used, which are, in
a sense, centering columns and rows simultaneously (more about global effects in sec-
tion 4.3.1). There are many possible ways to realize learning parameters, resulting in
varying predictive accuracy, running time, degree of simplicity of the algorithm, robust-
ness, and other characteristics. The parameters ui , vj can be interpreted as random vari-
ables in approximate Bayes approaches (here were used Variational Bayes and MCMC
methods, see section 4.3.3), or as parameters to optimize inside a regularized cost function
in neural networks-like approaches (section 4.3.2). Another interesting option, not yet ex-
tensively explored, was to regularize the whole approximating matrix U V T by a matrix
norm (section 4.3.7). As for the optimization algorithms, a choice to make was either to
use a first order method, like gradient descent, following the first derivative of the cost
function, or a second order method, like alternating optimization of the parameters by
jumping to the marginal minimum of the quadratic cost function. In all regularized SVD
methods, approximate Bayesian and NN-like, there was a choice either to learn a group of
all user parameters ui at one time (all movie parameters vj , respectively), or to learn each
parameter uik separately. Biases ci , dj were learned in a preprocessing phase, or treated as
part of the model, similarly to the parameters ui , vj . The output of predictive models was
clipped to range 1 − 5 (it is also possible to use a smooth sigmoidal function to restrict
the range of the output).
The models used usually contained a few additional parameters (regularization con-
stants, hyperparameters, learning rates) to optimize, and a different choice of values re-
sulted in varying accuracy measured by RMSE. In my implementations, those parameters
were optimized either roughly by hand-tuning, or in a few cases automatic tuning was
used, using the procedure Praxis [Bre71] from the Netlib library.
The structure of missing data has to be taken into account [Mar08], particularly the
property that users tend to watch and rate movies they like. Modelling the missing data
structure is probably more important for generating recommendations, but it helped also
to some extent to improve prediction accuracy on the task of minimizing test RMSE (see
section 4.3.5).
An additional advantage of the regularized SVD methods is the possibility of easy
parallelization.
100
variability. In the Netflix task, as summarized in [Dro11], 57% of variance of ratings is left
unexplained, 33% is explained by best models of biases [Kor09b], and 10% explained by
the model part capable of personalization, such as matrix factorizations.
The idea behind using biases and other kinds of global effects is that to model non-
complex global patterns, trends or dependencies noticed in data it is sufficient to use only
few parameters. More complex models, like the regularized SVD (which contains millions
of parameters when applied to the Netflix data), usually are capable to model the same
global effects, but modelling an effect with more parameters than necessary can cause
overfitting the data. Directly modelling the noticed global effects reduces the number of
model parameters needed and thus often improves predictive accuracy.
Let’s list the global effects most commonly used to augment various methods for the
Netflix Prize task. Proper modelling and learning of global effects was useful not only in
matrix factorizations, but also in other accurate methods used in the Netflix task, like
RBM or K-NN. The three most frequent, called sometimes the baseline methods, were:
global mean, user and movie biases. Models using only those three effects were examined
more closely in section 4.2 “Simple models”. In [Bel07b] several compound global effects
were proposed in the form: User × fj , denoting learning a separate parameter for each
user i, multiplied by a fixed per-movie effect fj . The case of user bias is obtained by setting
fj = 1. Movie compound effects are constructed similarly: Movie × fi , where one parame-
ter per each movie j is learned, multiplied by a fixed per-user effect fi . The best performing
compound global effects were: User × { Movie average, Movie standard deviation, Movie
support, Time(user)0.5 , Time(movie)0.5 } and Movie × { User average, User standard de-
viation, User support, Time(user)0.5 , Time(movie)0.5 }. Time(user) denotes time elapsed
(the number of days) since the first rating of the user, and Time(movie) denotes time
elapsed since the first rating given to the movie. It is an open question, how to automate
searching for simple or compound global effects in data.
An additional possibility to improve accuracy was convolving the global effects with
an appropriately chosen kernel expressing distance in time between two ratings of a user
(see [Tos08b], part 16, “Global Time Effects” for details about this method). The paper
[Tos08b] reports RMSEqual = 0.9450, and RMSEprobe = 0.9509 for 19 global time effects
learned sequentially, with each effect having 5 parameters set by automatic parameter
tuning (the method of tuning was described in [Tos08b]).
[Kor09b] obtained a global effects model with RMSEqual = 0.9278 by splitting the item
bias according to the rounded logarithm of the frequency effect (frequency was defined as
the number of ratings given by a user at a given day [Pio09]).
One decision to make is which effects to use. Another decision is to choose the best
way of learning the parameters. In section 4.2.3 “Only biases” I compared several ways
of treating the parameters of a simple model of user and movie biases, and compared
several ways of optimization. For the simple model of biases the results of learning are
similar for approximate Bayesian methods, such as Variational Bayes, MCMC, simple
MAP point estimation, and for regularized cost function optimization (here equally well
work first order methods like gradient descent, and second order methods, which boil down
to alternating shrinked estimation of effects, or to alternating ridge regressions with few
parameters). The experiments with nonparametric shared priors supported the assumption
of approximate Gaussian priors for biases.
There are several possible ways of combining global effects with more complex models:
global effects can be learned sequentially, one on residuals of another, and used as prepro-
cessing, or GE can be learned simultaneously, and used as preprocessing of other models,
or GE can be treated as a component of the larger model, and learned simultaneously with
the remaining parameters of the model. As seen in section 4.2.3, simultaneous learning
gives better accuracy from sequential learning, but the accuracy improvement is smaller
101
when using combining GE with larger models, and even smaller when comparing the en-
semble accuracy. Sequential-preprocessing was used in [Fun06] to learn biases, in [Bel07b]
to learn compound GE, and in [Tos08b] to learn GE with a time convolution term. GE were
used as a component of the model in [Tak07a, Tak07b] (biases and other fixed effects) and
in [Pat07] (biases only). In [Pot08] biases were learned simultaneously, user day bias was
added, and a correction for variance of user ratings, obtaining in effect RMSEqual = 0.9488
by using global effects only, with addition of user day bias. All three methods of learning
GE parameters (sequential-preprocessing, simultaneous-preprocessing, and as part of the
model) were used in the methods described later in this work.
Learning can vary depending on how we want to evaluate the method. We can either
optimize predictive accuracy of global effects alone, or optimize accuracy of the larger
model that is combined with the global effects (as a preprocessing or as part of the model),
or learn the parameters of global effects to optimize accuracy of blended ensemble of many
methods [Pio09, Tos09]. Different choices of a learning scheme can give different accuracy,
computation time, or ease of tuning (manual or automatic). In this work mostly the first
two evaluation criteria (accuracy of GE alone and accuracy of a larger model combined
with GE) were used to choose the best GE for a method but the ensemble accuracy was
a criterion of model selection, and gave hints where to focus efforts when exploring the
space of possible models.
102
• how to tune the additional parameters, such as regularization constants for different
groups of model parameters, or learning rates.
The considered cost functions, defined on a training set T r, have the form:
P
1 X 2 1X
l(θ1 , ..., θP ) = (rij − r̂ij (θ1 , ..., θP )) + λp θp2 (26)
2 2
ij∈T r p=1
After specifying the form of the function r̂ij , which aims to approximate ratings rij
from the training set and to predict new ratings, we can learn the parameters of r̂ij by
minimizing the cost function l. One way to minimize is to use gradient descent, that is
following the negated gradient of the function l:
∂l X ∂ r̂ij ∂l
=− (rij − r̂ij (θ1 , ..., θP )) + λp θ p θp -= η
∂θp ∂θp ∂θp
ij∈T r
With the rating estimation by (27), the resulting cost function (26) is quadratic with
respect to each parameter when the remaining parameters are fixed.
Different realizations of the neural networks SVD were tried out, with different choices
of learning algorithms and regularization. The non-regularized attempts [Sar00, Kur07]
overfit the data heavily and have inferior accuracy to the regularized ones. Moreover, for
the Netflix dataset with 99% of data missing, sparse learning methods that ignored the
missing data were more accurate (on the Netflix task) and much faster than methods using
data imputation.
The most known implementation of regularized SVD in [Fun06] used linear regular-
ization, with learning biases and each feature once. In [Fun06] first movie biases dj are
learned, then user biases ci on residuals of dj , both learned by a jump to the marginal
minimum of cost function with a constant regularization parameter λcd = 25. For each k,
the pair of features uik , vjk is learned on the residuals of biases and features for all pre-
(k−1)
vious k 0 < k: r̂ij = µ + ci + dj + k−1
P
k0 =1 uik0 vjk0 , by minimizing the cost function with
stochastic gradient descent. The regularization parameters for uik , vjk are linearly depen-
dent from the number of observations of user and movie: λuik = 0.02 Ni , λvjk = 0.02 Nj .
The stochastic gradient descent update rules for each observation (user i gives a rating rij
for movie j) are following:
(k−1)
uik += η((rij − r̂ij + uik vjk )vjk − λuik )
(k−1)
vjk += η((rij − r̂ij + uik vjk )ujk − λvik )
The learning rate parameter η was fixed to 0.001. The above algorithm gave RMSEqual ≈
0.91 [Fun06]. As the author of that method pointed out later [Fun07], better results gives
learning uik , vjk simultaneously for all k.
A related algorithm to regularized SVD, Maximum Margin Matrix Factorization
(MMMF) [Sre05], regularizes matrix factorization with the matrix trace norm. MMMF
proposed in [Sre05] for collaborative filtering used hinge loss, but with MSE loss it would
be equivalent to regularized SVD of a dense matrix, with a constant regularization pa-
rameter, and with the dimensionality constraint removed.
103
In [Pat07] biases ci , dj were learned simultaneously with uik , and vjk for subsequent
k, with special regularization of biases, obtaining RMSEqual = 0.9070 for K = 96 features
(without including 15% of the probe set in the training set).
The works [Tak08a, Tak08c] proposed the algorithm BRISMF (Biased Regularized In-
cremental Simultaneous Matrix Factorization), which is similar to [Fun06], but all param-
eters, including biases, are learned simultaneously. The method gave RMSEqual = 0.8962
for K = 100 features.
[Sal07a] used a momentum term to speed up the gradient descent, and their imple-
mentation learned parameters in batches.
Instead of using gradient descent, another method of optimizing the cost function (26)
was marginal optimization of single parameters or groups of parameters by jumping di-
rectly to the minimum of the cost function, with the remaining parameters fixed. The
following methods optimized groups of features at a time – all preferences of a given user,
and all features of a given movie. [Bel07d] optimized (26) with non-negative least squares.
[Pat07] used one-time ridge regression to post-process gradient descent SVD (nonlinear
kernel ridge regression worked better). [Zho08] optimized the cost function (26) with alter-
nating ridge regression. [Pio09] used ridge regression with a special diagonal regularization
matrix:
ui = (ViT Vi + (λ1 + λ2 Ni )V )−1 (ViT Y + αu)
where Vi are features for movies rated by user i, u is a weighted average ui from the previous
iteration, weighted by the user support. V is a diagonal matrix containing diagonal values
of ViT Vi , averaged for all users i.
Generally, gradient descent learning resulted in converging to better minima than alter-
nating ridge regression of groups of parameters, but accuracy and convergence of alternat-
ing ridge regression improve largely when using a good prior estimate of user preferences
– see section 4.3.5. [Cic09] observed similar difficulty in alternating minimization in non-
negative matrix factorizations, and [Mis11] proposes algorithms for learning a factorized
distance matrix, that circumvent similar difficulties of second order optimization methods.
The cost function (26) can also be optimized one parameter at a time by a jump to
the local minimum, and this learning method in turn gives similar accuracy to gradient
descent learning. A method of this type was proposed in section 4.3 of [Bel07a]. Instead of
using shrinking estimation with one variable ridge regression, special weighting of residuals
was used in [Bel07a] in the form of shrinking the residuals, a method close to one variable
ridge regression with constant regularization. Optimizing one parameter at a time was
also used in [Pil10].
Throughout the work I implemented several variants of regularized SVD, starting from
multiple variants of gradient descent SVD with biases [Pat07], through alternating ridge
regressions with various forms of regularization, learning one parameter at a time or mul-
tiple parameters, to approximate Bayesian learning. Most experiments with SVD after the
previous publication [Pat07] were attempts to approximate the Bayesian model [Tom07]
(described in the next section 4.3.3) and its Variational Bayesian version. The most accu-
rate SVD methods in my experiments, among the cost function based approaches, used
constant-linear regularization and learning one parameter at a time, with each feature
fully learned (15-20 iterations) before learning the rest, and with all iterations repeated 2-
4 times. The methods SVDM and SVDMT, described in section 4.3.6 “Time effects”, were
time-enhanced variants of regularized SVD in the cost function approach, with improved
user preferences, and with the regularization enhanced by a shared prior variance of user
parameters. The earlier mentioned SVD variants [Fun06, Pat07, Tak07a, Bel07d, Zho08]
used linear regularization λu Ni for users and λv Nj dla movies, which is better than reg-
ularization with constants λu , λv , but inferior to linear regularization with an additional
constant term λu1 + λu2 Ni for user i, who rated Ni movies, and λv1 + λv2 Nj for a movie j
104
rated Nj times (the constant-linear form of regularization is suggested by the approximate
Bayesian inference, see section 4.3.3). The regularization parameters in my implementa-
tions were most often tuned automatically by the Praxis procedure from the Netlib library,
or sometimes tuned by hand.
The method SBRISMF [Tak08b, √ Pil09d] adds to the regularization additional terms
beyond the constant and linear: Ni and Ni / log Ni . With automatic parameter tuning of
all regularization parameters (seven times more regularization parameters than BRISMF),
RMSEquiz = 0.8905 was obtained for K = 1000 features.
An advantage of using the same dataset and evaluation criterion is the possibility to
compare solutions of different authors. Table 14 lists various neural-networks-like SVD
implementations of various authors, with accuracy measured by RMSEquiz on the Netflix
Prize dataset.
RSVD, RSVD2, BASIC, and SVDRR were described in [Pat07]. BASIC is a set of six
predictors: empirical user probabilities, and movie mean. RSVD is the original regular-
ized SVD [Fun06] with global effects by BASIC. RSVD2 is regularized SVD with biases.
SVD1N1 is RSVD2 with less features and changed number of iterations. SVD1N9A is
RSVD2 with regularization (shrinking) of the sum of movie features. SVD1N19 is RSVD
with nonlinear postprocessing. SVDRR is RSVD2 postprocessed by ridge regression on
normalized movie feature vectors. SRR4 is RSVD2 postprocessed by ridge regression, and
SRR5 is RSVD2 postprocessed by approximate ridge regression. In the listed variants of
RSVD and RSVD2, features were learned one time, but with biases in RSVD2 learned
simultaneously with features. This way of learning was inferior to repeating 2-3 times
learning of all features, or to simultaneous learning of all features (both used in matrix
factorization variants in my later experiments). All of the above methods were parts of
the ensemble in [Pat07] (that paper described the most contributing methods of the 56
predictors in the ensemble), and made also minor contributions to the newer ensemble,
listed in chapter 5 “Experimental results”.
We have seen, that the cost function (26), (27) of the regularized SVD can be optimized
105
in many different ways, resulting in different accuracy. As experiments in section 4.2.5
showed, the priors of the regularized SVD parameters, usually for simplicity assumed
to be Gaussians, should really be different distributions. The used cost function is an
inaccurate approximation of Bayesian inference in an inaccurate probabilistic model, and
optimizing it in different ways with early stopping results in ending up in different areas
the hold-out test error, with the data underfitted or overfitted. RMSE accuracy is affected
by the number of iterations, the order of learning parameters or groups of parameters,
choice of learning rates, size of batches, initial values of parameters.
Rotating the matrix of item features gives equivalent matrix factorizations U V T =
U RRT V 0T , and different ways of optimization results in different factorizations. Learning
0
the features sequentially, one k-th feature at a time (even with re-learning features several
times on the residuals of the remaining ones), results in a factorization similar to linear
algebra SVD, where the singular values are sorted from the largest magnitudes (see sec-
tion 4.3.4 for an attempt to explain the meaning of the most significant features). In turn,
learning all features simultaneously can learn any rotation. Because the cost function can
have multiple local minima (also, marginal minimization of groups of features can cause
the optimization algorithm to get stuck), and because all algorithms use early stopping,
when the algorithm decides in subsequent iterations on different rotations, it influences
the final values of parameters and RMSE accuracy.
On the basis of own experiments I summarize in table 15 advantages and disadvantages
of different choices from the perspectives of predictive accuracy, amount of computation,
and with my judgement of usage convenience in a recommender system.
The “alt.min., single” method denotes the mentioned earlier method with constant-
linear regularization and learning one feature at a time, with learning of all features re-
peated three times. The MCMC and Variational Bayes (VB) methods are described later
in section 4.3.3.
A disadvantage of gradient descent methods is the need to additionally tune learning
rates. Too small learning rates increase computation time, and too large learning rates
cause the algorithm to diverge. Another disadvantage of gradient descent methods is that
if there are regularities in the input data, the parameters can be periodically biased during
learning, for example, in my implementations users and movies were sorted by frequency,
and the set of movie features differed at the time of learning parameters of different groups
of users.
As for the computational complexity, the number of iterations in gradient descent SVD
was usually 100 − 250 (it is possible to reduce it to only a few iterations, using variable
learning rates [Tak08a, Kul09]). In the gradient descent methods that learn all features
at a time the computational complexity is O(LN K), where L is the average number of
iterations, N is the amount of data – the number of ratings observed (100 million), and
K is the number of features to learn (typically 30 − 200). In the gradient descent methods
that learn one feature at a time the limiting factor is reading all values in the matrix
106
LK times, and the number of iterations increases, because learning of features has to be
repeated several times on residuals of the remaining ones. In some cases a limiting factor in
stochastic gradient descent methods can be reading and writing LN times to the matrices
of parameters U and V .
Second order methods that learn all features at once have the complexity O(L ∗ (M ∗
K 3 + P ∗ K 3 + N ∗ K)), where M is the number of users (480, 189), P is the number
of movies (17, 770), and the number of iterations L is typically about 15. For users who
rated few movies, instead of ridge regression, which requires storing and inverting a matrix
K × K one can use kernel ridge regression inverting a matrix Mi × Mi , where Mi is the
number of ratings of user i [Pil09a]. When learning one feature at a time with a second
order method (alternating jumps to the marginal minimum of the cost function), and
fully learning one feature before moving on to the next one, the complexity is O(L2 ∗ L ∗
(M ∗ K + P ∗ K + N ∗ K)), where the number of iterations L is 10 − 20, and the entire
learning process is repeated L2 times, usually 2 − 4 times, with every feature learned on
the residuals of the rest of the model.
The calculations can be easily sped up by parallelization, because the tasks of cal-
culating user preferences are independent of each other (except that they read the same
movie features), similarly as the tasks of movie features are independent (except reading
the same user preference vectors). One easy way of parallelizing is using the “parallel for”
instruction in the OpenMP framework.
Summarizing the topic of neural-networks-like regularized SVD, the most important
observation is that the constant-linear regularization λ1 + λ2 n is better than the linear
term alone λn, and the linear regularization in turn is much better than the constant
regularization λ. This fact is explained by the form of approximate Bayesian inference,
presented in the next section. The constant-linear regularization term may be useful in
other neural-network-like applications, where it may work better than the typical in neural
networks choice of constant regularization.
Because the approximated underlying probabilistic model is inaccurate for the gathered
rating data (see section 4.2.5), identifying the underlying model better can possibly lead
to neural-networks-like simplified methods with even better accuracy.
107
ing team’s ensemble in the year 2007 with the best method having RMSEquiz = 0.8888
[Bel07c]). JT’s Bayesian PCA alternated between sampling the model variables, one vari-
able at a time, according to their conditional posterior distributions p(uik |D, θ\uik ),
p(vjk |D, θ\vjk ). To save space I skip the inference equations for uik and vjk , which are
analogous to the MCMC model described earlier in section 4.2.5 “One-feature regularized
SVD with biases”, but with the residuals including the remaining sampled features k 0 6= k
subtracted from the rating:
K
(k)
X
yij = rij − ci − dj − uik0 vjk0
k0 =1,k0 6=k
JT’s Bayesian PCA also contained also time effects, such as user day bias, and modelled
2 (the variance of the error term in the model) differing
the output variance parameter σij
between users and movies.
The models [Har07, Sal08, Mac10, Sha10] realized approximate matrix factorization
by alternately sampling whole vectors of features in the MCMC algorithm. The basic SVD
model in both MCMC and VB methods was:
rij ∼ N (µ + ci + dj + uTi vj , σ 2 )
ui ∼ N (u, Su ) vj ∼ N (v, Sv )
Some models assumed u = v = 0. The matrices Su and Sv from the shared priors were
either assumed to be diagonal, or dense, and either flat hyperpriors or inverse Wishart
hyperpriors were used. A necessary assumption is constraining one of the matrices Su
and Sv , for example, fixing it to the identity matrix, or repeatedly normalizing the matrix
during running the algorithm. If both matrices are unconstrained, usually the values in one
matrix tend to grow to infinity, and in the second one decrease to zero. Models proposed
by different authors also vary by the choice of the method of learning biases.
The MCMC algorithm with point estimation of hyperparameters (the method can be
called Monte Carlo empirical Bayes) iteratively repeats the following:
• sample from the posterior distribution p(U |R, V, θ U ),
• point estimation of hyperparameters θ U of the prior p(U |θ U )
• sample from the posterior distribution p(V |R, U, θ V )
• point estimation of the hyperparameters θ V of the prior p(V |θ V )
where θ U = (u, Su ), and θ V = (v, Sv ).
The conditional posterior distribution p(U |R, V, θ U ), having sampled V and point es-
timated prior for U , is calculated using Bayes rule:
Eui = (V(i) T
yi + Su−1 u) Cov ui
1 −1
T −1
Cov ui = V V + S
σ 2 (i) (i) u
where V(i) is the matrix of sampled features for movies rated by user i. Calculating the
conditional posterior of V is analogous, with V(i) switched to U(j) (the matrix of sampled
user preferences for users who rated movie j).
Variational Bayesian learning, in turn, assumes independence of groups of parameters
(in our case U and V ) in the posterior distribution, and approximates the posterior with
that constraint by minimizing the Kullback-Leibler divergence of the approximation from
the true joint posterior [Bis06]. In the resulting algorithm [Lim07, Bis99b], instead of sam-
pling U and V as MCMC, the expected value is calculated: EV log p(U |R, V ), alternately
108
with EU log p(V |R, U ). The resulting formulas for posterior distributions are very similar
to the MCMC method, only instead of sampled U and V , their expected values are used,
and an additional component appears summing variances Cov vj and Cov ui :
T
Eui = (EV(i) yi + Su−1 u) Cov ui
1 1 X −1
T −1
Cov ui = EV(i) EV(i) + Cov v j + Su
σ2 σ2
j∈Ji
Evj = (EU(j)T
yj + Sv−1 v) Cov vj
1 1 X −1
T −1
Cov vj = EU (j) EU (j) + Cov ui + Sv
σ2 σ2
i∈Ij
The components containing Cov vj , Cov ui grow linearly with the number of observa-
tions for a user, or a movie, which justifies the use of a linear regularization component
in the most accurate neural-networks-like SVD variants. (see the last section 4.3.2, and
section 4.2.5).
The assumption of independence of posterior U and V in VB is reasonable, but the
choice of minimizing KL-divergence is disputable. As discussed in section 4.2.5, an in-
consistency in the used model and the VB with KL divergence is in that for the output
variable rij assumed as Gaussian, we assume priors for uik , vjk (the variables multiplied in
the model) being Gaussians, and the approximated posterior distributions of uik , vjk are
also Gaussians. This way the output variable rij is approximated with a sum of “spike”
variables – normal product distributions.
On fully observed matrices VB matrix factorization has an analytic solution, if assum-
ing independence of features (columns of U and columns of V ) in the posterior distribution
[Nak11a, Nak11b].
Alternating MAP point estimation was also tried for probabilistic models similar to
the above model [Tip99, Row98, Bis99a, Bel07c, Bel07d, Sal07a], with inferior accuracy to
both approximate Bayesian SVD and neural-networks-like SVD with proper regularization.
Expectation-Maximization learning, where user features are treated as in VB, and movie
features as in MAP method, also gave worse accuracy than the best obtainable [Rob10,
McM96, Can02].
Table 16 lists results of Bayesian SVD obtained by various authors. I implemented
several variants of VB SVD, learning one or multiple features at a time, with different
choices of priors, but they did not improve the ensemble in presence of the JT’s Bayesian
PCA. Some implemented variants of neural-networks-like SVD were heavily inspired by the
Bayesian learning, as the SVDMT2 method, described in section 4.3.6 (see also chapter 5).
109
I described the methods realizing the basic probabilistic matrix factorization model.
They can be further improved by using time effects (section 4.3.6), better priors on user
preferences (section 4.3.5), and using more refined global effects than the simple biases
(section 4.3.1).
110
Example movies with their learned feature values
Figure
Movie 30:
1 ExamplePositively
movies with theirwith
correlated learned
movie feature
1 values,correlated
Negatively part 1 with movie 1
Braveheart Gladiator (cor = 0.92) Coffee and Cigarettes (cor = −0.29)
● ● ●
0.2
● ● ●
0.2
●
Feature value
Feature value
Feature value
● ●
● ●●
0.2
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●●●
● ● ● ● ●● ● ● ● ●
●● ● ● ●●● ● ● ● ● ●●
●
● ●●
● ● ● ● ● ●●
−0.2
● ● ● ●
● ●●
−0.2
● ● ● ● ● ●
● ● ●● ● ● ● ●
−0.6 −0.2
● ● ● ● ●
●
●
−0.6
−0.6
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
● ● ●
0.2
●
● ●●
Feature value
Feature value
Feature value
● ● ●●
● ● ●● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ●
●
0.2
● ● ●
● ●● ● ● ●
−0.1
● ● ● ●
−0.2
● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ●● ●
● ● ● ● ●
● ● ●● ●
● ● ●●●● ● ●
● ●
● ● ●
−0.2
●
−0.4
−0.6
● ●
● ●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Feature value
Feature value ●
0.2
●
0.2
●
● ● ● ●
● ● ● ●
● ● ●● ● ● ●●●●● ● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●
●
● ●● ● ●
0.0
●● ●●●● ● ● ● ● ●
−0.6 −0.2
● ● ● ● ● ●
−0.1
● ● ● ● ● ●● ● ●
● ● ● ● ●
●● ●
● ●
● ● ● ●
−0.4
●
−0.4
● ●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
● ● ●●
●● ●
0.2
● ●
Feature value
Feature value
Feature value
● ● ● ●
● ● ●●
● ● ● ● ● ● ●
● ● ● ● ●
0.2
● ● ●
● ●● ● ●● ●
−0.2
● ● ●● ● ● ● ●● ● ● ●
● ● ●●
−0.2
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●●● ● ●
●● ● ●
● ● ● ●●
−0.2
● ● ● ●
●
●
−0.6
−0.6
●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Feature value
Feature value
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
0.0
● ●
0.0
●
0.2
● ●● ● ●● ● ● ●● ●
●● ●● ● ● ● ●●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
●
●
●
● ●● ●●●●
● ● ●
● ●● ●●
● ● ●●
● ●
−0.2
●
−0.4
● ● ●
−0.4
● ●
●
●
●● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
● ●
Feature value
●●
Feature value
Feature value
●● ● ● ●
●
● ●● ●
0.0
● ● ●● ●● ● ●
● ● ●●●● ● ●
● ●
0.2
● ● ●
● ● ● ● ●
● ● ● ● ●●
0.0
● ● ● ● ● ● ●
●
−0.4
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●● ●
● ● ● ●
● ● ● ●
● ● ●●
−0.2
● ● ● ●
●
−0.4
−0.8
● ● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
111
Example movies with their learned feature values
Figure
Movie 31:
1 ExamplePositively
movies with theirwith
correlated learned
movie feature
1 values,correlated
Negatively part 2 with movie 1
Breathless Easy Rider (cor = 0.67) Daredevil (cor = −0.47)
0.4
● ● ●
● ●
Feature value
Feature value
Feature value
● ● ● ● ● ● ●
0.4
0.0
● ● ●● ●● ● ●●●
● ● ● ●
● ●●● ●● ● ● ● ●
●●●
● ●●● ● ●●●
0.0
● ●● ● ●
● ●● ● ● ●
● ● ● ●
● ●
−0.8 −0.4
● ●
0.0
● ● ●● ● ● ●
● ● ● ●●
● ●● ● ●●
●● ● ●●
−0.4
● ●
● ●
−0.4
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
● ● ● ●●
● ● ●
●
Feature value
Feature value
Feature value
● ●●
0.4
●● ● ●● ● ● ●
● ●
0.0
●
● ●● ● ●●
● ●● ●
● ● ● ●● ●●
● ● ●● ● ● ●
● ●● ●
●
−0.2
● ● ●
● ●● ●● ●
● ●
0.0
● ● ● ● ●● ●
● ● ● ●● ●
● ● ● ● ● ●
● ●● ●
−0.4
● ●
● ● ● ●
−0.6
−0.4
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
● ● ●
● ●
0.4
Feature value
Feature value
● ●●●● ●
● ●
●● ● ● ●
● ● ● ● ●●
● ● ● ●● ● ● ●
● ●● ● ● ●
0.0
● ●●● ● ●
● ●● ●● ●
● ●
0.0
●
● ● ● ● ●●●● ● ●
● ● ● ●
●● ● ●● ● ●● ● ●
−0.4
●
●
●
−0.4
●
−0.4
● ●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Feature value
Feature value
0.4
●
● ● ●● ●
● ●● ● ●●●
0.0
● ● ●
●
● ● ● ●●● ●● ● ●● ●
●● ● ●
●●● ● ● ● ● ●
0.0
● ● ● ● ● ● ●●
● ● ●● ● ●
0.0
● ●● ● ●●
●●●● ●●●● ●●● ●●
−0.8 −0.4
● ● ● ● ● ●
● ● ●
● ●
● ● ●
−0.4
−0.4
●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
0.2
● ● ● ●●
● ● ● ●
Feature value
Feature value
Feature value
● ● ● ●●
●● ●● ● ● ● ●
● ● ● ●● ●● ● ●●
●
●● ●
●
● ●●●● ●
● ● ● ● ●
−0.2
● ● ● ● ● ●● ● ● ●
●● ● ● ●
−0.2
● ● ● ● ● ● ● ●● ●
0.0
●
● ● ● ●
●● ● ●
● ● ● ●
● ● ●
● ●
−0.6
●
−0.6
−0.4
●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
● ● ● ●
0.2
●
Feature value
Feature value
●
Feature value
● ●● ● ● ●
● ●●
● ● ● ●● ● ●
● ● ●● ● ● ● ●
● ● ● ●● ● ●
● ● ● ● ● ●
●● ●
0.0
●● ● ●
0.0
●●
−0.2
● ● ● ● ● ● ● ● ●●
●● ● ● ● ● ● ● ● ● ●●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
●
−0.4
●
−0.6
−0.4
● ●
● ● ●
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
112
Table 17: Minimal and maximal values of features, among 4000 most popular movies.
Feature 1 Rank Movie Feature 4 Rank Movie
0.6310 373 Daredevil 0.5630 130 The Godfather
0.5550 6 Pretty Woman 0.5050 381 Basic Instinct
0.5450 568 Alien vs. Predator 0.4820 739 Mad Max
0.4920 44 Titanic 0.4590 60 Braveheart
... ... ... ... ... ...
-0.7210 1731 Coffee and Cigarettes -0.2250 373 Daredevil
-0.7450 913 Pi: Faith in Chaos -0.2360 1947 Firefly
-0.7630 255 Taxi Driver -0.2640 1731 Coffee and Cigarettes
-0.8180 2587 Breathless -0.2750 491 Charlies Angels
The next table 18 shows my interpretation of the first six automatically learned features
in the regularized SVD, that explain most of the variability of ratings.
We could also try to name the positive and negative part of a feature with single words.
My attempt to name the first six features: Idealization vs. Realism, Safety vs. Surprise,
Distrust vs. Fairy Tale, Testosterone vs. Feminism, Innocence vs. Heroism, Journey vs.
Growing Up.
The above interpretation of features was created based on observation of sorted lists
of movie titles. I performed an additional experiment that confirmed to some extent the
proposed interpretation – table 19 lists the keywords from IMDb database that are the
best predictors of the first six SVD features, according to a chosen criterion. Each feature
was predicted by ridge regression with a regularization constant λ = 10, with predictors
113
chosen by greedy feature selection among 21 genres and 579 keywords from the IMDb
that appear at least 50 times among 2000 movies with the largest support in the Netflix
database. The scoring criterion for features in the greedy feature selection was the sum
of variance explained and aPfraction of the
P correlation between the IMDb feature and
2 2
the predicted SVD feature: i (ŷi − y) / i (yi − y) + 0.05 |Cor (Xk , Y )| . Each of the
six features from the regularized SVD was predicted by 30 IMDb features (genres and
keywords) chosen by the greedy feature selection.
As we see in table 19, a relatively large amount of variance was explained by binary
IMDb tags, although perhaps more tags are needed to describe features 4, 5, 6. The listed
values of variance explained were estimated on the training data, and are heightened due
to overfitting.
114
When “folksonomy” tagging is used (tags edited by community, as in IMDb), the
quantity and quality of tags increase with movie popularity. For the 2000 most popular
movies from the Netflix dataset, as we see from the amount of variance explained, the tags
predict the SVD features well. Tags can be used, for example, to produce good priors for
item features in SVD-type algorithms, which can be useful in cold start situations for items,
often encountered in recommender systems (see the use of IMDb tags to predict features
of movies without any ratings in the Netflix database – described in section 4.7). But for
movies with more ratings, such as the movies in the Netflix Prize dataset, experiments
gave counter-intuitive results, that augmenting the movies with metadata has no effect on
accuracy – a small number of ratings is more informative than any amount of metadata
[Pil09b, Lee08].
One can try to interpret the automatically learned features from the viewpoints of
psychology, anthropology, sociology, culture, ethics. Each movie can tell something about
groups of people who rate it: can imply capability of feeling emotions, can tell about
personal value systems, moral codes, worldviews, ideological beliefs. The most significant
features mark out traits of many users, indicated by many movies. We can surmise that
the features depend on gender (features 2+ and 4- likely indicate female, and 4+ male),
young age (6+), character traits like sensitivity, neuroticism (feature 1-), or life attitudes
like conformism, acceptance, orderliness (feature 1+). We can trace the meaning of features
to capability of feeling emotions, like fear or anxiety (feature 5+), or hormone levels heavily
influencing emotions, such as high testosterone level (features 4+, 5-) or dopamine (2-,
6+). We could also search for interpretations of the automatically learned information by
looking at directions other than the unit vectors in the space of the most meaningful SVD
features.
The above interpretations of features, although supported by the observed data, are of
course only guesswork, and would require confirmation by conducting appropriate surveys
by a trained psychologist.
We can spot in the above examination some connection to movie story types. In the
Hollywood Stories dataset, released in 2011, in addition to the standard genres, movies
were annotated with the following 22 story types: comedy, love, monster force, quest,
rivalry, discovery, pursuit, revenge, transformation, maturation, rescue, escape, the riddle,
journey and return, underdog, sacrifice, temptation, fish out of water, metamorphosis,
tragedy, wretched excess, rags to riches.
Prediction in the opposite direction is possible – to predict the IMDb tags using the
SVD features as predictors, for example, using the logistic regression. Such a prediction can
be useful for identifying missing tags, recommending tags in a folksonomy tagging process,
or discovering rotations of the feature space – new features most understandable for users.
Table 20 shows 50 tags most accurately predicted by the 32 SVD features. Accuracy was
measured by the deviance ratio in the logistic regression. The tags tested were a subset
of all IMDb tags with support at least 50 among the 2000 most popular movies in the
Netflix database.
Table 21 shows 50 tags (with support at least 50) worst predicted by the SVD features.
115
Inaccurate prediction indicates that the meaning of those tags is distant from the meaning
of the features important for prediction. Those tags are likely needless for predicting ratings
or related purposes, and are candidates for removal from the set of tags.
Now we will look closer at what the SVD features tell us about the standard move
genre classification. Going back to explaining SVD features by metadata, table 22 is similar
to the previous table 19, but here predictors were only genres. The 6 predictors for each
SVD feature were chosen from the 21 standard genres (genre classification by IMDb), as
earlier, using greedy feature selection in ridge regression.
Some of the six most meaningful SVD features are less well explained than the others
by the standard genre labels. This suggests that the standard set of genres needs to be
augmented by creating new meaningful genres (bearing in mind, that binary 0-1 genres
have limited power of expression, comparing to the SVD features, which can take posi-
tive or negative values on a continuous scale). For example, only 9% of variance of the
fourth feature was explained by existing genres – the fourth feature was interpreted earlier
(table 18) as “violence, aggression, men’s world, patriarchy vs. independent women, fem-
inism, matriarchy”, or shorter, as “Testosterone vs. Feminism”. Based on these guessed
meanings, one or two new binary genres could be created to explain better the fourth
feature. Features 1, 5, 6 may also suggest new genres. Features 3 and 2 appear to be suffi-
ciently well described by the standard genres. Other linear combinations (other than the
unit vectors) of the top SVD features can also be examined and interpreted, to discover
important characteristics of movies, unnamed by the standard taxonomies.
One might wonder whether some of the standard genres are unnecessary and can be
removed. To find it out, a reverse procedure to the previous was carried out – predict-
ing the binary genres with the logistic regression, using all 32 SVD features as predictors
(similar prediction of genres was carried out in [Sel11] on IMDb matched with the Movie-
Lens dataset). The same method was used as in the previous experiment (tables 20, 21),
where SVD features were used to predict IMDb keywords. If we assume that all impor-
tant information about movies is included in the SVD features, the relevant genres should
116
be predicted well by the SVD features. The genres least accurately predicted by logis-
tic regression, as measured by the ratio of deviance vs. null deviance, were: Adventure,
Biography, Crime, Drama, Music, Mystery. Some of these genres are less well predicted,
because they are very common, and thus have more “fuzzy” meaning (for example, Drama
appears in 1012 movies among the 2000 most popular used in this study). On the basis
of the results, in my judgement the candidates to remove are three of the less common
genres: Biography, Music (leaving Musical), and Mystery.
To complement the examination of the standard genres, table 23 lays out the average
values of first six SVD features for movies (among the 2000 most popular) having the given
genre tag. Because the columns 4 and 6 contain only a few, and small negative values, it
suggests that the standard genre labelling cannot express the SVD genres 4- (the presumed
meaning is “independent women, feminism”), and 6- (“youth, growing up, friendship”)
Table 24 shows the 20 most correlated pairs of genres, and the 20 least correlated
pairs. Visible large positive and negative correlations suggest possible inefficiencies and
redundancy in the standard set of genres (but not necessarily – correlated directions can
have the same expressive power as uncorrelated ones).
One pattern noticed in the feature values is that the average taste changes with pop-
ularity. The global average of normalized vectors of item features is (-0.155 -0.032 -0.172
-0.029 0.239 0.212 ...), and the average for the most popular 200 movies is (0.19 0.013
0.057 0.0077 -0.18 -0.11 ...). Various interpretations of the observed pattern are possible:
one is that the pattern represents the notion of popular taste. One may wonder here, if the
relationship is causal – whether a produced movie has a better chance to become popular
when it has features 1+ (idealization), 5- (heroism) and 6- (growing up), and the features
to avoid are 1- (realism), 3- (fairy tale), 5+ (innocence) and 6+ (journey). Another inter-
pretation is that the pattern is a result of different inaccuracies in the regularized SVD
algorithm appearing for groups of items with largely different amount of ratings, and the
automatically learned hidden genres may correct the prediction for those inaccuracies.
117
Table 24: Largest correlations between average genre vectors.
More experiments are needed on varied data to decisively explain this observed pattern.
The above analyses relating SVD features and IMDb genres were heuristic, and should
be additionally confirmed by directly modelling the influences of genres and keywords on
ratings, by building and training an additional model, instead of utilizing only the SVD
results. The notion of “genre” should also be rethought, and more precisely defined. The
standard genre theory [Cha97] does not give a precise definition of what is a genre, so I
assumed here that genres are named subsets of movies, just like another keywords in the
IMDb dataset.
Summarizing, I examined the first six features from regularized SVD, using the Netflix
data augmented by IMDb keywords and genres. I listed movies with extreme values of the
six features, calculated representations of the features with IMDb keywords and genres,
proposed an interpretation of the six features as new genres, and also examined the stan-
dard set of genres by IMDb, considering, where it should be augmented by new genres, and
which standard genres seem unneeded. The subspace of features learned by an SVD-type
algorithm can be spanned by different vectors than the unit vectors – we could choose
different rotations of the SVD features that would be more “clean”, understandable and
intuitive for a human. One might wonder, what causes that humans decide on a taxonomy
such as “action”, “comedy”, “horror” as the dimensions of the space to place movies. It is
a question entering the domains of psychology and culture, perhaps related to the concept
of meme. Understanding the formation mechanism of such taxonomy could result in di-
mensionality reduction methods with better explainability, which, as user studies show, is
a desirable feature in recommender systems. A similar need to examine, improve or create
taxonomies with help of dimensionality reduction methods appears also in other domains,
like e.g. for factor analysis methods used in psychometrics.
118
they like. This pattern can be used to improve the accuracy of collaborative filtering
methods.
As noted in [Mar04, Mar09], there are two basic ways of encompassing missing data
structure in a model. One is to treat missing data indicators as additional output variables
in a generative model p(Ratings, Indicators|Hidden) (the methods in [Mar04, Mar09] use
that approach). Another way, used extensively in the Netflix Prize task was to treat missing
data indicators as a fixed structure (not generated variables), on which parts of the model
are conditioned p(Ratings|Hidden)p(Hidden|Indicators).
The paper [Sal07a] proposed a Conditional RBM model (described in section 4.4.1),
which improves learning of hidden user variables in an RBM by conditioning them on the
vector of binary P indicators. The indicators influence each hidden user variable through a
sum of weights j∈Ji wjk corresponding to each observed rating (note that we can use
here also the information from the test set [Sal07a] – an idea called transductive learning).
The weights wjk are shared by all users. (the idea called transductive learning).
SVD-type algorithms can be similarly enhanced, as hidden user variables in RBM
are analogues of hidden user variables in SVD. [Sal07b] proposes a method Constrained
PMF, which enhances PMF P (alternating MAP estimation in a probabilistic SVD model)
by adding a term |Ji | −1
j∈Ji wjk to the user preferences uik . A similar method, SVD++
[Bel07e], combined matrix factorization (regularized SVD) with NSVD1 [Pat07] by adding
(0)
the term uik = |Ji |−0.5 j∈Ji wjk to uik . The parameters of SVD++ were learned by
P
minimizing the regularized MSE cost function on the training data using gradient descent.
Let’s describe first the NSVD1 (“New SVD 1”) and NSVD2 (“New SVD 2”) methods
[Pat07], which several variants were included in my final ensemble. The methods are based
on the idea of exchanging the user parameters uik to a function of the binary vector
indicating which movies were rated. It turns out that even without looking at the user
ratings (except estimating the user bias) we are able to explain a large part of user-
specific variability of ratings. In NSVD1 a function of M K new parameters wjk is used.
The NSVD1 model, called also asymmetric factor model, is following:
X
r̂ij = µ + ci + dj + (|Ji | + 1)−0.5 ( wj2 )T vj
j2 ∈Ji
Noticing the correlation between wjk and vjk , the model NSVD2 using vjk in place of wjk
was proposed: X
r̂ij = µ + ci + dj + ( vj2 )T vj
j2 ∈Ji
All parameters of the models NSVD1, P NSVD2 were trained with stochastic gradient de-
scent, but with calculating the sums j2 ∈Ji wj2 for each user only once per iteration.
NSVD1 and NSVD2 do not give very accurate predictions, but they combined well with
ensembles of other methods (also in prediction ensembles for other datasets, such as music
ratings [Jah11a, Jah11b]).
Blending regularized SVD with NSVD1 by linear regression gave better accuracy than
using regularized SVD alone. Such a situation is an indication, that it is possible to combine
both methods to obtain a single method with accuracy at least as good as the blend of
the two methods. And indeed, SVD++ methods [Bel07e, Bel07f, Bel08] were a way to
combine SVD with NSVD1 with good outcome. The SVD++ model has the form:
X
r̂ij = µ + ci + dj + (ui + |Ji |−0.5 wj2 )T vj
j2 ∈Ji
The initial idea behind NSVD1 and NSVD2 models was to reduce the number of
learned parameters (if not counting the use of implicit information about movies selected),
119
as fewer parameters can lead to reducing overfitting. In NSVD1 and NSVD2 the learned
user preferences depend on user ratings only through weights wjk , which are common
for all users. It leaves less opportunity to overfit, at the cost of reduced capability to
accurately model user preferences (the normalized sum of weights wjk does not even have
−0.5
P
the expressive power to predict a fixed constant |Ji | j∈Ji wj ≈ 1, and introduces
noise – on a side note, similar limitations imposed by the structure of summations exist
in multitask ridge regression used to calculate user preferences in the regularized SVD).
Moving to the topic of improving regularized SVD, in my experiments I tried out sev-
eral ways of improving the SVD model by predicting the mean preferences of each user.
Most of the tried out methods did not improve RMSE significantly, for example, using
for prediction the K-means clustering of user features. The only methods that improved
RMSE were those using missing data in similar way to SVD++ and NSVD1. Predicting
the means was used both in the neural-networks-like models and in the Bayesian SVD
(0)
models. In the Bayesian models the predicted means uik = |Ji |−0.5 j∈Ji wjk were used
P
in the prior distribution of uik with the remaining regularization parameters chosen by au-
tomatic parameter tuning (it is also possible to estimate the prior variance of uik , different
for each user, instead of tuning regularization). In neural-networks-like regularized SVD,
(0) (0)
the term uik was added to the learned user preference parameter: uik +uik . The difference
between my implementations and the formulation P of SVD++ [Bel07e] Pwas that I repeat-
edly optimized the parameters wjk to minimize ij∈T r (uik − |Ji |−0.5 j∈Ji wjk )2 , instead
(0)
of adding the uik term to the global cost function ij∈T r (rij − r̂ij )2 . The weights wjk
P
were optimized by gradient descent, with varied methods used to optimize the remaining
parameters uik , vjk , ci , dj .
Table 25: Methods with modelling user preferences using the missing data structure. Ex-
perimental results.
120
than |Ji |−0.5 , but the methods with learning the five constants did not improve accuracy
significantly.
[Pio09] proposed a modification of SVD++ called “milestone”,
P which gave accuracy
improvement by splitting the parameters w jk in the sum w
j∈Ji jk into a convex combi-
0
P
nation of two sets of parameters: (j,t)∈Ji aifit wjk + (1 − aifit )wjk , the parameters being
weighted by frequency-dependent per-user variables aifit (frequency fit is the number of
movies rated by user i on a given day t).
Table 25 lists the methods that extend the user preferences with the use of binary
missing data information. The methods NSVD1 and NSVD2 in the table were described in
[Pat07]. “NSVD2 w/o qual” does not use the information from qualifying.txt about movies
selected for rating. NSVD1B is NSVD1 with weights ei = 1. NSVD1R is an NSVD1 version
that uses residuals of ratings instead of binary information on users selected. QNSVD1
is NSVD1 with weights ei = 1, and using only the information on movies selected from
probe and qualifying set (without the information on movies selected from the training
set). All the above methods were parts of the ensemble in [Pat07], and also contributed
to the final ensemble of this work (chapter 5 “Experimental results”).
Many of the most accurate methods in my ensemble were based on SVD++ variants,
enhanced by time effects, and combined with K-NN or postprocessed by KRR. These
SVD++ variants were listed in other sections: 4.3.6 “Time effects”, 4.5.3 “Kernel meth-
ods”, and 4.8.1 “Preprocessing and postprocessing”.
Figure 32 shows three scatter plots demonstrating the relationship between the pa-
rameters vjk and wjk in an SVD++ variant, for features 1, 5, 10, and a scatter plot of
121
P
j∈Ji |wjk | plotted against the square root of movie support. Further examination showed,
that the source of the pattern observed on the first three plots is that the correlation be-
tween vjk and wjk increases with the increasing amount of user ratings.
A “flipped”, movie-oriented model was tried
−0.5
Pout in [Tos08b, Tos09], where weights wik
were summed to model movie features |Ij | i∈Ij wik . Flipped versions of SVD++ were
very accurate on a task of separating high ratings from unrated items in music ratings
data [Jah11b].
An open question left is how to properly use the missing data information to improve
multitask Kernel Ridge Regression (Gaussian Processes) approach (see section 4.5.3).
There were several attempts to use PCA of the binary matrix of indicators, for exam-
ple, BSRM/F [Zhu08].
The described way of using missing data turned out to improve accuracy of models in
the task of predicting ratings from held-out training data, but in situations when we need
to predict ratings for non-user-selected items, as is the case of calculating personalized
recommendations, it is likely that a more sophisticated way of using missing data will be
needed.
where ŷij is the residual of the remaining part of the model. In [Kor09b] additional per-day
scaling parameter was proposed, shared by several effects:
P
j∈Jit ŷij
cit = sit
|Jit | + λ
To model longer correlations than one-day, we can split the parameter into longer
periods of time, using less parameters (a method called binning), or using methods such as
122
exponential smoothing or convolving the user bias with time-dependent kernels [Tos08b].
In some of the implemented methods I used a one-directional exponential moving average
of residuals: P
j,t∈Ji :t≤t2 exp(C|t2 − t|)ŷij
cit2 = P
j,t∈Ji :t≤t2 exp(C|t2 − t|) + λ
or a bidirectional moving average:
P
j,t∈Ji exp(C|t2 − t|)ŷij
cit2 = P
j,t∈Ji exp(C|t2 − t|) + λ
We should note that the above idea of capturing trends using factorization does not work
so well, and other time effects give much larger accuracy improvements.
There was also a proposal to add a global effect – a bias bt for each day t, common
for all users. As seen on the plot 17 in section 3.5, the average of ratings visibly changes
over time. It turns out, that the global per-day variations are rather well explained by the
remaining variables of the model, so the day bias bt did not give a significant accuracy
improvement, and was not part of the most accurate models [Kor09b, Tos09, Pio09].
Similar methods to the used for biases can be used to model variability of user pref-
erences uik (t)T vjk over time. The following formula was proposed in [Bel08, Kor09b] to
model user preferences:
uik (t) = uik + αik devi (t) + uik,t
(0) (0)
where devi (t) = sign(t − ti )|t − ti |0.5 . Hence two time effects are added to the ordinary
user preferences uik : a variable αik capturing the trend, and a variable uik,t capturing
the one-day correlation of residuals. Adding the term uik,t resulted, on average, in about
40, 000 additional variables per user in the model. Modelling time correlations is easier
in the dual method to the ridge regression, KRR (see section 4.5.3 “Kernel methods”),
where capturing correlations in time is reduced to modifying the covariance matrix between
residuals of item ratings.
123
In addition to user time effects, some accuracy improvement resulted from modelling
per-movie time effects. Simple binning was most commonly used for that purpose:
P
i∈Ij,bin ŷij
cj,bin =
|Ij,bin | + λ
where Ij,bin are users who rated movie j on dates t ∈ bin. In the paper [Kor09b] dates
were divided into 30 equal-sized bins, each spanning about 10 weeks. Also, factorization of
movie bias was proposed, expressing the change of bias in time with a linear combination
of jointly learned trends:
djt = xTj zω (31)
(0) (0)
where ω = t − tj (tj is the date of the first rating of movie j).
A group of time effects, which inclusion improved accuracy, were frequency based effects
[Pio09], that is effects including the frequency component Fit = |Jit | (the number of ratings
given by user i on one day). A plausible explanation [Pio09, Kor09b] of why the frequency
effects exist in the Netflix data, is that a user rates movies in a different way, when he
watched the movie recently, and differently when he watched the movie long time ago.
The time elapsed since viewing the movie is not observed, but indicated indirectly by the
frequency – if a user gives few ratings on one day, it indicates that the user likely watched
the rated movies recently. Interactions that turned out to be useful were interactions of
frequency with item-side variables [Pio09, Kor09b, Tos09]: movie biases cj , movie features
vj , NSVD-part variables wj , and global neighborhood weights zj . In BK4 models [Pio09]
and SBRAMF-* [Tos09] the value fit = Fit was used directly to create a separate set of
new variables for each possible value of frequency, for example, the movie bias dj becomes
dj,fjt . [Kor09b] used the rounded logarithm of frequency fit = blog |Fit |c to index dj,fjt
terms, and in [Pio09] frequency was split into intervals of different sizes (8 − 36 intervals).
I will list now several very accurate variants of time-dependent regularized SVD pro-
posed by the winners of the Netflix Grand Prize [Bel08, Tos08b, Kor09b, Tos09, Pio09]
and by other teams [Xia09]. The parameters in the following models were trained by gra-
dient descent minimization of a regularized cost function SSE (sum of squared error) on
the training set.
TimeSVD++ [Bel08, Kor09b] adds time effects to user biases, movie biases, and user
preferences (in the notation of this work);:
X
r̂ui (t) = µ + ci + c0i devu (t) + cit + dj + dj,Bin(t) + (ui + u0 devu (t) + uit + |Ji |−0.5 wj2 )T vj
j2 ∈Ji
where
(0) (0)
devu (t) = sign(t − ti )|t − ti |0.4
(0)
where ti is the date of the first rating given by user i.
The method PQ2 [Kor09b] adds to TimeSVD++ [Bel08] per-user scaling, and, fol-
lowing [Pio09], adds additional movie-side parameters for different levels of the user daily
frequency:
r̂ui (t) = µ + ci + c0i devu (t) + cit + (dj + dj,Bin(t) )(si + sit ) + dj,fit +
X
+ (ui + u0 devu (t) + uit + |Ji |−0.5 wj2 )T (vj + vj,fit )
j2 ∈Ji
The authors note in [Kor09b], that in contrast to frequency-aware movie biases, frequency-
aware movie features did not improve the model by much.
The method SBRAMF-UTB-UTF-MTF-ATF-MFF-AFF [Tos09], listed in table 26 as
SBRAMF-*, is a variant of TimeSVD++ [Bel08, Kor09b] extended by frequency-based
124
movie features and frequency-based asymmetric features. The method includes extended
regularization of parameters from the model SBRISMF [Tak08b] (see section 4.3.2). The
model has the form:
r̂ui (t) = ci + cit + dj + dj,Bin(t) +
X T
+ ui + uit + |Ji |−0.5 (wj2 + wj2 ,Bin(t) + wj2 ,fit ) (vj + vj,Bin(t) + vj,fit )
j2 ∈Ji
The method BK4 [Pio09] is TimeSVD++ [Bel08] integrated with the global neighbor-
hood model [Kor08], extended by frequencies, and factorization of time-dependent user
biases (30), movie biases (31), and per-user scaling. Because the BK4 model contains a
global neighborhood component, it is listed in section 4.8.2 “Integrated models”. The BK4
method postprocessed by K-NN [Pio09] (see section 4.8.1 “Preprocessing and postprocess-
ing”) has the best RMSE accuracy among the published methods.
Table 26 lists a method TimeSVD++ + STE [Xia09], which uses factorized time-
dependent user biases (31), movie biases (30), and user preferences, global time bias, and
several additional time effects: year and month effect, and effects named loyalty, activity
and popularity [Xia09].
The bottom of the table 26 lists the time dependent methods implemented by me.
SVDM is a version of regularized SVD with a regularization form inspired by the learning
equations of VB SVD, extended by using user day bias cit , estimated by shrinked average
daily residuals as in the equation 28, with λ = 20, but in a leave-one-out version, skipping
0 during estimation. Predictions in SVDM have the form: ŷ
the current residual ŷij ij =
0 T
µ+ci +dj +cit (j)+ui vj . User preferences and movie features are estimated one parameter
at a time by a jump to the marginal minimum of the regularized cost function:
(k) (k)
vjk ŷij + uik λ1 /τk2
P P
j∈Ji i∈Ij uik ŷij
uik = P 2 + λ /τ 2 + λ |J | vjk = P 2
j∈Ji vjk 1 k 2 i i∈Ij uik + λ3 + λ4 |Ij |
where the mean of the prior user preference uik is estimated by NSVD/SVD++ (sec-
tion 4.3.5 “Improved user preferences”), with the NSVD term rescaled in five groups of
users with different level of support C(|Ji |)|Ji |−0.5 j∈Ji wjk by the following constants
P
found by automatic parameter tuning: C(|Ji |) ∈ {0.7, 1.66, 1.88, 1.45, 0.99}, and with anal-
ogous five distinct learning rates. The parameter τk2 approximates the variance of prior
(k)
distribution V ar u·k of the given feature k. The current residuals ŷij are ratings with all
(k)
global effects and features subtracted, except the current feature k: ŷij = rij − ŷij +uik vjk .
The number of features K was set to 30. The regularization parameters were chosen au-
tomatically by the Praxis procedure: λ1 = 0.77, λ2 = 0.022, λ3 = 3.14, λ4 = 0.005.
SVDMT is SVDM extended by time-dependent user and movie biases – biases ci and
dj were split into 32 bins with a length of 70 days. SVDMT postprocessed by KNN
(details in sections 4.8.1 and 4.5.1), with the training of SVDMT and KNN repeated twice
on residuals of each other, gave the predictor SVDMT2KNN, the most accurate method
developed in this work with RM SE15 = 0.8819 .
Table 26 compares RMSE of different methods. The time-aware methods that addi-
tionally user neighborhood models were moved to the table 39 in section 4.8.2 “Integrated
models”.
Summarizing, using time effects similar to the listed ones was one of the keys to pass
the barrier of 10% improvement of RMSE over the reference algorithm, and winning the
Netflix Prize competition [Bel08, Tos08b, Kor09b, Tos09, Pio09].
It turned out that the time dependent user bias was the effect, including which improves
RMSE the most, but in a real recommender system, at the moment of calculating a
recommendation list predictions for all movies have the same user bias. We can conclude
125
Table 26: Models with time effects.
Method Learning Grouping K RMSE15 RMSEquiz
Bayesian PCA [Tom07] single MCMC 60 0.8805
TimeSVD++ [Bel08] multi gradient 200 0.8806
TimeSVD++ [Kor09a] multi gradient 200 0.8799
PQ2 [Kor09b] multi gradient 200 0.8777
Integrated Model [Tos09] multi gradient 150 0.8806
SBRAMF-* [Tos09] multi gradient 150 0.8788
mfw31-60-10-120-m [Pio09] multi gradient 150 0.8883
TimeSVD + STE [Xia09] multi gradient 100 0.9027
BPTF [Xio09] multi gradient 100 0.9044
SVDM single MAP VB-like 30 0.8919 (∼ 0.8950)
SVDMT single MAP VB-like 30 0.8909 (∼ 0.8940)
that the time effects on the user bias are less important than it is indicated by the RMSE
criterion.
min ||YS − (U V T )S ||22 + λ||U ||22 + λ||V ||22 = min ||YS − XS ||22 + λ||X||Σ (32)
U,V X
If the matrix Y is fully observed (S = 1N 1TM ), then the optimization of X in the cost
function with the trace norm regularization is a convex problem, in contrast to optimization
of U and V with the Frobenius norm regularization of U and V (but for the optimization
of U and V all local minima are global [Sre04]). The solution X̂ is a global minimum of
126
(32) if and only if [Cai08]:
Y − X̂ ∈ λδ||X||Σ (33)
where δ||X||Σ is the set of subgradients of the trace norm:
One can verify by substitution, that (33), (34) is satisfied by a shrinked result of a dense
SVD of the matrix Y :
XK
Ŷ = max(0, γk − λ)uk vkT (35)
k=1
where uk is the k-th column of U , vk is the k-th column of V , and γk is the k-th singular
value in the SVD of the matrix Y = U ΣV T .
The above solution with shrinked singular values of a standard, linear algebra SVD
works for a fully observed matrix Y . If relatively few data values are missing (and the data
is missing uniformly), EM-like data imputation algorithm can be used. The algorithm
Singular Value Thresholding [Cai08, Liu09] repeatedly iterates between calculating the
current approximation X of the matrix YS of observed values, and filling the missing
values with the result of the shrinked SVD (35). If, in turn, a large part of the matrix is
missing, data imputation leads to inaccuracy, but if we ignore the missing data, U and V
can be learned, for example, with gradient descent [Ren05, DeC06].
I tested a modification of the SVT [Cai08] algorithm on a subset of the Netflix Prize
dataset with less than 50% data missing. The modification was inspired by the form
of the solution of dense Variational Bayesian SVD [Nak09, Nak10a, Nak10b, Nak11a,
Nak11b] (let’s remind that sparse VB SVD are among the most accurate methods of
matrix factorization for the Netflix Prize task).
In [Nak09], for the probabilistic model of matrix factorization: yij ∼ N (uTi vj , σ 2 ),
ua−priori
ik
a−priori
∼ N (0, τu2 ), vjk ∼ N (0, τv2 ) the analytic solution of the Variational Bayesian
approximation is calculated for a case when all data is observed:
K
X 1
Ŷ = γ̂k uk vkT γ̂k = max(0, γk − max(M, N )σ 2 − ∆k ) (36)
γk
k=1
where U, V, γk come from the standard linear algebra SVD of training data. [Nak09] gives
also bounds on ∆, when N 6= M , and the exact solution, when N = M . The form of
the analytic solution (36) of VB suggests [Nak09], that in the assumed probabilistic SVD
model, with a fully observed matrix, VB inference is equivalent to combining trace norm
regularization, positive-part James-Stein (PJS) shrinkage, and Frobenius norm regulariza-
tion.
The modification of SVT in my experiment, comparing to [Cai08], was changing the
shrinkage of singular values to C1 γk − C2 /γk − C3 , with adaptively choosing the three
parameters C1 , C2 , C3 in the each interaction, using linear regression on the test set (one
iteration fills missing data with the result of modified SVT from the previous iterations,
and then calculates the new shrinked SVD on the augmented dense matrix). The form of
shrinking C1 γk − C2 /γk − C3 was chosen after trying several plausible forms. The modified
SVT was run on a small subset of the Netflix Prize data, which is 50% dense (56% dense
together with the test set) – how that dataset was selected, it was described in section 3.3.
Table 27 shows the result of the modified SVT. In chapter 5 it is compared with other
algorithms trained on the same subset of the Netflix data.
After 10 iterations of the method, the learned weights (rounded): C1 = 1.023,
C2 = 2560, C3 = 16.6, turned out to be close to regularization by combining the trace
norm and the PJS estimator, which corresponds [Nak09] to VB SVD. The resulting weights
127
Table 27: Models with matrix norm regularization. Experimental results.
are an additional argument (to the all experiments of many teams working on the Netflix
data), that the considered model of probabilistic factorization is close to the unknown
optimal model.
A question remains whether to limit the approximation to a function of singular vectors
from the linear algebra SVD – whether it is not worth to go beyond the matrix norms
that are functions of singular values of the approximated matrix. One attempt was to use
matrix completion with max-norm regularization [Cho11].
The modified SVT described above was tested on a subset of the Netflix Prize dataset
that is 50% sparse. For the entire Netflix dataset, which is about 99.8% sparse, more
appropriate are algorithms like regularized SVD, which do not impute missing values.
Also, constant regularization λ is inferior to the regularization amount increasing linearly
with the number of observations (the number of user ratings to regularize U , and the
number of movie ratings to regularize V ). The rescaled regularization has the following
form:
1
||DU XDV ||Σ = minX=U V T (||DU U ||2F + ||DV V ||2F )
2
p
where DU is a diagonal matrix p containing λU 1 + λU 2 |Ji | on the diagonal, and DV is a
diagonal matrix containing λU 1 + λU 2 |Ij | on the diagonal (|Ji | is the number of ratings
given by user i, and |Ij | the number of ratings given to movie j). Using a linear regular-
ization can be justified not only experimentally, or by the form of approximate Bayesian
SVD – as shown in [Sal10], the linear amount of regularization follows from considering
the matrix norm of a matrix composed by disjoint submatrices of different sizes.
In summary, the cost-function-based formulations of the collaborative filtering that
use the matrix regularization by a spectral norm, ultimately come down to SVD-type
algorithms, and this is an additional justification for studying the algorithms in the form
of regularized SVD directly.
An advantage of the SVT algorithms is that they can be easily realized using the stan-
dard dense SVD implementations in libraries, avoiding convergence and tuning issues that
appear in gradient-descent-based algorithms. Algorithms similar to SVT may be useful
when obtaining best possible accuracy is not a priority, on datasets with a small amount
of missing data. The drawbacks of SVT are necessity of imputing missing values and the
used non-optimal, constant regularization. A natural direction of further development of
the matrix norm regularization algorithms is proposing an algorithm with regularization
rescaled by constant-linear terms – the form of regularization that worked the best on the
Netflix Prize task and similar collaborative filtering tasks.
128
product of hidden variables, but transformed by a function chosen suitably to the data
(for example, many variants of matrix factorizations were listed in [Sin08, Sin09, Del08]).
The idea of generalized matrix factorization is to change the model Y = U V T (or a version
with biases Y = C + D + U V T ) to yij = g(uTi vj ) (resp. yij = g(ci + dj + uTi vj )), where g
is a non-linear transformation function (link function). Generalized matrix factorizations
were useful in the Netflix Prize task to bound the output of matrix factorization (predicted
rating) to the range 1 − 5. In this work, the only way used of bounding the output was
clipping predictions to range 1 to 5 when training and during making predictions (except
for two methods, SVDB1 and SVDB2, which used MF with logistic transformation pre-
dicting probability that a movie was rated by the user). Some accuracy improvement can
be obtained by using a smooth transformation of the outputs, for example, in [Pio09] a
shifted sigmoid transformation was used:
2(σ0 −1)
1+ −2(x−σ ) if x < σ0
1+exp( σ −1 0 )
σ(x) = 0
2(5−σ0 )
2σ0 − 5 − 1+exp( −2(x−σ0 ) ) if x ≥ σ0
5−σ0
The OrdRec model [Kor11] improves accuracy of SVD++ by treating the ratings as
being on an ordinal scale, with user-specific thresholds. Other ordinal MF models were
proposed in [Ste09, Paq10].
Another variation of matrix factorization, non-negative matrix factorization (NNMF)
[Lee00, Zha06] is to non-negatively constrain the variables. NNMF contributed to en-
sembles of several teams in the Netflix Prize [Wu07, Bel07c], and was the most accurate
factorization in the ensemble in [Bel07c]. In [Pio09] non-negative matrix factorizations
were sometimes used with an inverted rating scale (1 switched with 5). In my experiments
I tried imposing non-negativity on one Pof the variables in the regularized SVD, by modify-
ing the model to k=1 exp(uik )vjk or K
PK
k=1 uik exp(vjk ). Those methods were present in
the ensemble for some time, but ultimately were removed, when they stopped improving
accuracy.
Smooth convex polyhedron prediction (SMAX) [Tak09b] uses multiple sets of user
preferences, and combines the resulting factorizations using a selected smooth maximum
function.
In BMFSI with Dirichlet Process Mixtures [Por10a, Por10b], a variant of matrix fac-
torization, user preferences and movie features were augmented with side information
(similarly as in [Tak07a]), and several different priors for feature vectors were learned.
The prior for each vector was selected in the model by clustering users and movies with
Dirichlet Process Mixtures. The clustering gave a small accuracy improvement. A larger
improvement was obtained by using the following additional features as a side information:
the normalized date of a rating, ratings for the five nearest neighbors (the most similar
movies) and the previous two ratings given by the user.
Another accurate variant of MF was Mixed Membership Matrix Factorization (M3 F)
[Mac10], in which user and item topics affect the rating through a contextual bias added to
the basic matrix factorization model. The user and item topics are realized as a mechanism
of mixed membership, that is using a discrete variable with learned user-specific and item-
specific Dirichlet priors, similarly as in the LDA algorithm [Ble03], and, for dyadic data,
in Bi-LDA [Por08, Por10b]. Two types of contextual bias were proposed in [Mac10]: Topic
Indexed Bias (TIB) and Topic Indexed Factor (TIF). Table 28 lists RMSE of the TIB
version. TIB augments the basic probabilistic matrix factorization model with additional
user biases, one for each user topic, and item biases, one for each item topic.
Nonparametric Bayesian Matrix Completion [Zho10b] is a matrix factorization model
modified by drawing, for each column of the matrix, a separate set of binary selectors from
a Bernoulli distribution, specyfying which singular values are zeroed.
129
Matrix factorization models can be generalized to modelling simultaneously multi-
ple relationships [Zha09], for example, a user-item relationship (collaborative filtering)
together with a user-user relationship (such as a list of trusted users).
Bayesian Factorization Machines [Fre11] are able to model different kinds of patterns
(time effects, implicit information, K-NN, interactions between groups of features, etc.)
using binary or real-valued features. Features and combinations of two or more features are
weighted by factorized parameters, creating a tensor factorization model, with parameters
learned by unblocked Gibbs sampling. The variant BFM(u,i,t) [Fre11] listed in table 28
used the user id, the item id, and the date of rating as features. Feature-Based Matrix
Factorization [Che11b] is a restricted version of BFM.
Table 28 summarizes RMSE’s of the mentioned above alternative variants of matrix
factorizations.
Summarizing the subject of matrix factorizations, the idea of multiplying hidden vari-
ables uTi vj was chosen arbitrarily, but it turned out to be the foundation of the most
accurate models in a large number of experiments carried out by many teams indepen-
dently working on the problem. This gives reason to believe that the matrix factorization
models are close to the optimal model, that is, to the unknown model about which we
can say that it generated the collected data. Instead of using the bilinear term uTi vj we
can try modelling interactions between the hidden variables nonlinearly. This opportunity
is discussed in section 4.4. In experiments of many teams the most effective nonlinear
methods were variants of RBM [Sal07a], described in section 4.4.1. Although the accuracy
of individual RBM methods was worse than the accuracy of matrix factorization variants,
RBM improved ensembles of matrix factorizations, and we can conclude, that RBM mod-
els learn some aspects of data that are not captured by matrix factorizations (but the
improvement can be partially explained by the capability of modelling probability of each
output 1 − 5, and not necessarily by the structure of hidden variables).
130
hard to find another way than exploring possible models by trial and error, guided by the
resulting accuracy, individual and in combination with the ensemble.
The choice of the output variable for the data was discussed in section 4.2.2. The most
common choice was modelling the output as a Gaussian variable, clipped or transformed by
a sigmoidal function. Another choice is to model the output with a binomial distribution.
Yet another choice, used e.g. in RBM, is to use a multinomial output, and model the
probability of each rating 1-5 separately. The methods listed in this section use mainly
the multinomial output, but Gaussian visible units were used also in RBM [Tos08b].
A natural and well performing framework for collaborative filtering is multitask learn-
ing, with per-user tasks of predicting the user ratings for all items. In such multitask
models we have a set of variables expressing individual user preferences, separate for each
user, and a set of variables for each item, shared by all users. One can also consider the
analogous, “flipped” multitask learning setting with per-item tasks. Various types of regu-
larized SVD can be understood as multitask learning methods. For example, the multitask
learning method Bayesian PCA [Bis99a], a probabilistic model for collaborative filtering
with Gaussian variables as user preferences, trained by expectation-maximization, is close
to PMF [Sal08], probabilistic sparse matrix factorization with alternating MAP learning
of parameters.
Turning to the non-linear methods, Restricted Boltzmann Machines (RBM) is a multi-
task model with binary hidden units (per-task variables which learn the user preferences),
multinomial visible output units (there were also versions with Gaussian hidden units
[Sal07a] or with Gaussian visible units [Tos08b, Pio09]), and a set of per-item vectors of
weights, shared by all per-user tasks. Learning the item weights is not straightforward,
presents computational difficulty, because there are exponentially many configurations of
hidden variables. In practice, the contrastive divergence method (CD) gives a good ap-
proximation of maximum likelihood estimation of weights.
A different realization of multitask learning is, instead to condition the generated rating
on configurations of multiple hidden variables (user preferences), condition the rating on
only one hidden variable (user preference), that has a role of a selector. I will describe
brefly two models of this kind: PLSA [Hof99b, Hof04] and URP [Mar04].
In PLSA [Hof99b, Hof04] (PLSI [Hof99c] is a similar model), called also the dyadic
aspect model [Mar04], the set of ratings for all items of a given user is drawn among a num-
ber of fixed patterns. The basic version of PLSA, used to model co-occurenceP of words
w in documents d, had Pa simple form: p(d, w) = p(w|d)p(d) = p(d) z p(w|z)p(z|d) or
equivalently p(d, w) = z p(z)p(d|z)p(w|z). In [Hof99a] learning w and z by Expectation-
Maximization was proposed, in a Tempered EM version (TEM). A PLSA version adapted
for rating prediction
P [Hof03], called also the triadic aspect model [Mar04], has the form:
p(r|u, i) = z p(r|i, z)p(z|u), with a multinomial distribution p(z|u) of the hidden selector
variable z, and with a Gaussian [Hof03] or multinomial [Mar04] output rating distribution
p(r|i, z), for a given item i and hidden variable z. PLSA does not give satisfying accuracy
on the Netflix Prize task. In my experiments, a variant of PLSA with multinomial out-
put, learned with TEM, with annealing the learning rate and MAP step instead of ML
(penalized EM – regularized estimates) gave RMSE15 = 0.9449 and did not improve the
ensemble.
In the User Rating Profile (URP) model [Mar04], similarly as in PLSA, ratings are
generated through choosing among a fixed number of rating patterns (distributions of all
ratings), but in URP the rating pattern, called there the user attitude, is not drawn once
to generate all ratings of a user, but is drawn for each item separately. The generative
process in URP is following: for each user sample the vector of parameters θ from a
Dirichlet prior (the prior is shared among all users), then for each item i draw zi from the
multinomial distribution p(zi |θ) and generate the rating from a multinomial distribution
131
p(r|zi , i) (estimated jointly for all users for every item i). A similar model to URP, but with
a uniform prior instead of the jointly learned Dirichlet prior over user attitudes, was called
the vector aspect model [Mar04]. Variational Bayesian inference was used in [Mar04] for
learning the URP model. URP was one of the most accurate methods among the compared
in [Mar04] on the EachMovie and GroupLens datasets. I have not experimented with URP
on the Netflix dataset in this work.
The Bi-LDA model [Por08] combines a user-wise LDA [Ble03, Mar04] model with a
movie wise LDA. Bi-LDA assigns to each user a multinomial distribution on user-side
clusters, and assigns to each movie a multinomial distribution on movie-side clusters. To
generate each rating the user cluster and the movie cluster are drawn from their respective
multinomial distributions, and the rating is drawn from a multinomial distribution corre-
sponding to the pair of clusters. Bi-LDA had RMSE= 0.933 on the Netflix Prize dataset
[Por08]. A related, but better performing model is Biased LITR [Har11].
In addition to probabilistic models, simplified neural-networks-like approaches are pos-
sible, which define the predicted rating as a chosen function of unknown parameters,
and the parameters are learned by minimizing a regularized cost function. I implemented
several neural-network models, which did not improve ensemble accuracy – I skip their
description.
In the following section I describe in more detail the RBM model [Sal07a], which im-
proved accuracy of many ensembles [Bel07c, Bel08, Tos08b, Kor09b, Tos09, Pio09], and
improved also the ensemble in this work, in a directed version. Especially good accuracy
was obtained by postprocessing RBM methods by variants of K-NN [Bel07c]. The remain-
ing nonlinear methods listed above are less accurate and are not known to visibly improve
the ensembles for the Netflix task, but of course for the Netflix data we cannot exclude the
existence of some model with nonlinearly combined hidden variables, which, with the right
learning method, could have superior accuracy to matrix factorizations. The mentioned
well performing combination of RBM with K-NN is a suggestion where to look for such a
model.
and
M
X
p(vj = 1|h, W ) ∝ exp(− wjk hk )
j=1
132
It allows to sample from the joint distribution p(v, h|W ) with an MCMC method.
Now how to learn the parameters W . Let’s assume, that our observed data D is a set
of vectors v(i) for i = 1, ..., N , and that we want to learn the parameters W as maximum
a-posteriori (MAP) point estimates, that is by maximizing with respect to W the function:
M
Y X
(1) (M )
p(W |D) ∝ p(W )p(v , ..., v |W ) = p(W ) p(v(i) , h|W )
i=1 h∈{0,1}K
It was shown, that, with some assumptions, approximate estimation of the output of a
similar RBM model is NP-hard [Lon10], and a similar intractability result was shown for
approximate sampling.
In practice, for fixed real-world data, learning the parameters W by maximizing the
approximate log-likelihood function with gradient ascent works well (it is also possible to
add regularization by assuming a non-flat prior p(W )):
T
P
∂ log p(v(i) |W ) ∂ log h exp(−v(i)T W h) ∂ log v,h exp(−v W h)
P
= −
∂wjk ∂wjk ∂wjk
(i)
= vj p(hk = 1|v(i) ) − p(vj = 1, hk = 1)
Here appears a difficulty with calculating the second term p(vj = 1, hk = 1) in the gradient
ascent. One could use Gibbs sampling, but it has large computing time. In [Hin02] the
method of contrastive divergence (CD) was proposed, which is to use a very small number
of sampling iterations inside one iteration of gradient ascent. Sampling is initiated with
the current observation v(i) , and then even using only one iteration of Gibbs sampling
appears to work well in practice.
Instead of using contrastive divergence, it is possible to use Gibbs sampling p(v, h|W )
common for all gradient ascent updates, for all subsequent observation vectors, ignoring
that the weights W change during the sampling (also the pattern of missing data changes).
I have not tested this method on the Netflix Prize data.
Enhanced versions of the above RBM model were used for the Netflix Prize task. In
[Sal07a] the use of RBM for collaborative filtering was proposed, as a multitask model,
with each task being learning all ratings of one user, and with weights W shared among
all tasks (a symmetric setting with one task per movie is also possible). Each user has a
separate set hi of hidden variables, and the weights wkjl , connecting the hidden variables
hik with the outputs vjl , are shared by all users. The outputs vjl represent the observed
rating for a movie j, and are generated from a multinomial distribution, with exactly one
binary variable vjl (l ∈ 1..5) set to 1. Comparing RBM with SVD, the hidden variables hi
correspond to user preferences, and the weights W correspond to movie features, except
that there are 5 times more movie parameters in RBM than movie parameters (features)
in SVD with the same K. The increased number of parameters allows to model patterns
in ratings – we observed in section 4.2.2, that there are individual per-user patterns in
data that cannot be generated by single-parameter or two-parameter output modelling.
The RBM model in [Sal07a] was enhanced by adding biases for visible units, adding
biases for hidden units, and also enhanced by using dimensionality reduction to regular-
ize the parameters of W (called “Factored RBM”), and making use of the missing data
patterns to improve modelling of hidden variables – a method called Conditional RBM. I
will outline now the concept of Conditional RBM [Sal07a].
It turns out, that a significant improvement of RMSE on the hold-out set is obtained by
using the information contained in the structure of missing data. Conditional RBM [Sal07a]
conditions the hidden variables hik on a binary vector si indicating which variables are
observed by user i. The influence of each observed item on hidden variables is modelled
133
using an additional set of weights bjk , incorporated into the sum p(hik |v, b, W ) = σ((vi ◦
s0i )T Wk + bT si ), where σ(x) = 1/(1 + e−x ). A similar enhancement was applied to SVD-
type models (see section 4.3.5 “Improved user preferences”). The article [Kor09b] proposed
adding conditional visible units to RBM, and enhancements by time effects, including
frequency effects.
In rating prediction the observed vectors of user ratings are sparse, and a probabilistic
way of dealing with missing data is desirable. A simplified approach would be assuming
that data is missing uniformly at random and integrating out the missing data. An ap-
proximation proposed in [Sal07a] is simply ignoring the weights leading to the missing
items while sampling the hidden variables (this method works well in practice, but, as
noted in [Hin10], the result is not precisely correct). As outlined in section 3.4, the prob-
lem of missing data is more complex. [Mar08] proposed a model called cRBM/E-v, which
to some extent corrects predictions of the Conditional RBM model by using additional
biases, active on missing values.
To speed up learning with contrastive divergence I developed an RBM version with
directed weights. The idea is that weights learned by contrastive divergence, even with
only one step of Gibbs sampling, give hidden variables good enough to make accurate
predictions, provided that a separate set of weights is used to predict the outputs from
the hidden variables. In the directed RBM two separate sets of weights are learned: W up
weights from visible nodes to hidden nodes and W down from hidden nodes to visible. The
conditional probabilities are following:
M
up
X
p(hk = 1|v, W up ) ∝ exp(− wjk vj )
j=1
M
X
p(vj = 1|h, W down ) ∝ exp(− down
wjk hk )
j=1
In the proposed directed RBM: let v0 = v(i) (with the missing values in v(i) set to zeros),
h0 is sampled from p(h|v0 , W ), v1 is sampled from p(v|h0 , W ), and h1 is sampled from
p(h|v1 , W ). Then the updates of W up with the CD method have the form:
The Wdown weights are learned to reconstruct the observed data with highest probability,
with h0 treated in the moment of learning as fixed. Maximizing the likelihood of the
observed data gives a rule similar to wake-sleep [Hin95]:
The directed RBM methods improved RMSE by about 0.0002-0.0003 in comparison with
an analogous undirected RBM learned by contrastive divergence with T = 1. Three ver-
sions of the directed RBM were part of the final ensemble, listed also on the bottom of
the table 29. DRBM was a conditional version. DRBM2 and DRBM3 were unconditional
versions with different sets of parameters learning rate and weight decay. Comparing with
the RBM from [Sal07a], DRBM was not factored, used weight decay, used decreasing
learning rate, similarly to simulated annealing methods, and contained biases for visible
units (five for each movie), but did not contain biases for hidden units. DRBMKNN in
table 29 is DRBM postprocessed by my modification of KNNMovieV3 [Tos08b], described
in section 4.5.1.
Because RBM methods give as a result probability distribution of ratings, we can
examine classification performance of prediction of each rating 1 − 5. Figure 33 shows
134
Figure 33: DRBM classification accuracy.
1.0
1.0
Recall (True Positives Rate, Sensitivity)
1 1 1
2 2 2
3 3 3
4 4 4
0.8
0.8
0.8
5 5 5
0.6
0.6
0.6
Precision
F−score
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positives Rate (1 − Specificity) Recall Recall
three plots of the results of DRBM: receiver operating characteristic (ROC), precision-
recall plot, and F-score vs. recall.
Good results were obtained by postprocessing RBM with K-NN methods. The best
single method in the ensemble [Bel07c] was RBM postprocessed by a K-NN variant with
jointly derived weights [Bel07c].
Hidden variables were binary and the observed variables were multinomial in [Sal07a,
Mni10], but using other distributions is possible, e.g. Gaussian observed variables or Gaus-
sian hidden variables. CRBM with Gaussian observed variables were used in [Bel07c,
Tos08b].
Table 29 lists RBM implementations by various authors.
The experience of many teams working on the Netflix Prize task was that includ-
ing RBM methods significantly improves accuracy of ensembles [Bel07c, Kor08, Tos08b,
Kor09b, Tos09, Pio09], which shows that RBM capture some aspects of data uncaptured
135
by other methods.
There are many questions left about RBM methods. Is the underlying computational
complexity necessary? Does any similar model to RBM exist that has close accuracy, but
with smaller computational complexity? Do RBM have equivalent forms with matrix regu-
larization, as regularized SVD variants are equivalent to matrix trace norm regularizations?
What is the kernel (dual) version of the RBM model or similar models?
An untested idea is to model jointly the ratings p(ri |hi , W ) and the structure of missing
data p(si |hi , W ), where ri is the vector of user ratings (further split into observed ratings,
and missing ratings), si is the vector of binary indicators of which movies were rated, hi is
a vector of hidden user preferences, and W is the matrix of movie features, shared by all
users. The predicted probability of selecting the item to rate could help in ranking missing
items to create lists of personalized recommendations. Because predictions are made for
unobserved data, it would be helpful to tune the model using a sample of rated random
items, if such data were available.
Among the most accurate methods for the Netflix Prize dataset the following pat-
tern can be noticed: the global effects layer has O(N + M ) parameters, the dimensionality
reduction layer has O(KN M ) parameters, and the neighborhood layer has O(M 2 ) param-
eters, where N is the number of users, M is the number of movies, and K is the dimension
in dimensionality reduction (all bounds with a remark, that modelling the time effects can
increase the number of parameters up to several tens of times). Both K-NN and kernel
methods are good at capturing (explaining) local similarities, that is correlations for very
similar movies, as opposed to middle-level effects, where the dimensionality reduction used
does not allow to capture a large number of different local relationships, even when those
relationships are very significant.
136
Often in various applications it is easy to construe content-based similarity between
items. Such similarity information is often local – for each item we can recognize only a
few similar items. In such cases, if we do not have additional data about users’ prefer-
ences, distance-based methods are likely to be more accurate than methods performing
dimensionality reduction.
In the following discussion I will focus mostly on distances (or similarities) between
movies. Some authors considered also distance between users, but those methods con-
tributed less to the accuracy of ensembles, and they need more computation time. It
would be best to use the distance between observation pairs (user, movie), but such meth-
ods would have too large computational complexity.
where sj2 j is the similarity between movie j2 a movie j, and N (i, j) is the set of K movies
with the largest similarity to movie j among the movies rated by user i. Most commonly
chosen as the similarity measure were variants of Pearson correlation between common
observations [Sar01] (called also cosine similarity), on the centered data, that is with the
movie mean removed yij = rij − µ − dj , or generally, with global effects removed:
P
i∈Ij2 j yij2 yij
sj2 j = P P
( i∈Ij j yij2 )( i∈Ij j yij )
2 2
137
lecting nearest neighbors is a ranking problem, and involves both using regularization
(selecting the right prior), and correcting for the confidence of the calculated similarities.
Another improvement of accuracy was obtained by clipping correlations to non-negative
values [Tak07a, Bel07a, Tos08a].
Among the non-model-based techniques (like the described above) in my experiments
best worked the KNNMovie version [Tos08a, Tos08b], which applies a logistic transfor-
mation to regularized Pearson or set correlations, clipped to non-negative values. In the
version KNNMovieV3 [Tos08b], including the distance in time between observations, sim-
ilarities inserted into the formula (37) have the form:
|Ij2 j |
sj2 j =
|Ij2 ||Ij |
138
where weights sj2 j are calculated by non-negative least squares minimization:
X
minsj2 j >=0 (rij − r̂ij )2
i∈Ij
139
Table 30: Standalone KNN methods.
Method K (NN) RMSEquiz
Pearson-KNN on global effects [Bel07b, Bel07d] 20 0.9364
KNN, jointly derived weights [Bel07b, Bel07d] 50 0.9174
KNN, jointly derived weights, item-item [Bel07a] 50 0.9075
KNN, jointly derived weights, user-user [Bel07a] 50 0.9180
KNN, jointly derived weights [Bel07b] 50 0.9082
KNN, jointly derived weights, w/o global effects [Bel07c] 25 0.9496
Bin-KNN/CorrSim [Bel07c] 50 0.9215
Corr-KNN [Bel07c] 0.9170
Mse-KNN [Bel07c] 0.9237
Supp-KNN [Bel07c] 0.9335
Pearson-KNN [Tos08b] 30 0.9229
KNNMovieV3 [Tos08b] 55 0.9102
KNNMovieV4, jointly derived weights [Tos08b, Bel07b] 50 0.9112
NB [Tak07a] 16 0.9313
NB-CORR [Tak08b] 15 0.9280
To get the most accurate prediction we have to properly fully integrate near neighbor
models with the matrix factorization into one model. Such methods were listed in sec-
tion 4.8.2 “Integrated models”. One way to do it is augmenting SVD with local item-item
similarity information [Bel07a, Bel07c]. Another way is to add a neigborhood modelling
component to the SVD formula. Here best effects gave merging the matrix factorization
with two linear models (they can be seen as K-NN with shared O(M 2 ) weights) [Kor08].
Good results were obtained [Pio09, Tos09] by tuning parameters of a method to op-
timize RMSE of the whole ensemble, instead of optimizing individual RMSE’s. [Pio09]
emphasised that tuning neighborhood models this way gave especially large improvement
of the ensemble accuracy.
Table 31 lists chosen K-NN methods applied to residuals of other methods, K-NN
methods that use similarity based on SVD features, and factorized K-NN methods. The
K-NN methods on residuals of RBM were moved to table 29.
At the bottom of the table are listed the methods from my ensemble. KNN0 is 1-NN
prediction, using as distance the cosine similarity between SVD features. KNN1 is KNN0
weighted by exponential of the SVD-based distance. KNN20 is average residual of 20-NN
with distance as in KNN1. SVDMT2KNN is SVDMT2 (see section 4.3.6) postprocessed
with the modified KNNMovieV3 [Tos08b], described above.
On the Netflix Prize task mainly item-item K-NN was used, and occasionally user-
user K-NN. One other opportunity was left unexplored – similarity between observations
(user-item pairs). I will describe a heuristic approach, which can be classified as a near-
neighbour method based on similarity between observations, and which is fast enough to
run it on the Netflix data (although the experiment was performed only on a small subset
of the Netflix data). In regularized SVD the prediction for a given feature k, if assuming
a constant regularization λ, and estimating one variable at a time, has the form:
P (k) P (k)
j2 vj2 k yij2 i ui2 k yi2 j
uik = P 2 vjk = P 2 2
j2 v j2 k + λ i2 ui2 k + λ
Now the heuristic idea is to assume, that uik vjk explains a fixed percent C of the resid-
(k)
ual yij : uik vjk ≈ Cyij . (the index k will be skipped, and yij denotes a residual in the
current iteration). Roughly approximating:
P P
i2 j2 yij2 yi2 j ui2 vj2 i2 j2 yij2 yi2 j yi2 j2
ui vj ≈ P 2
≈ C 2 P 2
i2 j2 (ui2 vj2 ) + λ2 i2 j2 yi2 j2 + λ3
140
Table 31: KNN methods as postprocessing.
After writing down the above formula using matrices, we get a fast algorithm, which turns
out to have good accuracy. Each iteration k of the algorithm has the form:
k−1
X YkT Yk YkT
Yk = R − βk0 Ŷk0 Ŷk =
AT (Yk ◦ Yk ) AT
k0 =1
where ◦ is the entrywise (Haddamard) product, and A is a 0-1 matrix indicating which
ratings are observed in the training set. The coefficients βk0 are repeatedly calculated using
linear regression on the test set.
The algorithm can be understood as calculating a weighted average of ratings (resid-
uals), which are “close” to the predicted rating rij , where the “close” set of observations
is the data matrix clipped to movies rated by user i, and to users who rated movie j.
The algorithm was run for K = 6 iterations. The experiment was conducted on a small
subset (see section 3.3) of the Netflix data, where the training set is exactly 50% dense.
Method K RMSEsmall
NNObs 6 0.7482
The result is compared with RMSEsmall of other methods in chapter 5 (the best result,
RMSEsmall = 0.7415, was obtained by KRR).
For the sparse Netflix Prize dataset the fastest method of learning the parameters of
this model was stochastic gradient descent, here run separately for each movie, for 16 − 21
141
iterations. In [Pat07] I described only the most contributing linear model, but the ensemble
in [Pat07] contained several similar linear models with different constants (1, log |Ji |,
|Ji |−0.5 ), some using residuals of ratings instead of indicators of observed ratings, some
methods weighted users with the term |Ji |−0.5 while learning, and some methods used
the qualifying set for transductive learning. They are listed in table 33 as RRBA, RRBQ,
RRBAS, RRBAS2, RRBAL, RRRAS, and also were included in this work’s ensemble
(chapter 5). The above linear model is close to the Slope One prediction [Lem05], which
averages, for all movies rated by the user, over the average difference in ratings for users
who rated both movies.
The work [Kor08] describes a rating-based version:
X
r̂ij = dj + |Ji |−0.5 wj2 j (rij2 − dj2 ) (39)
j2 ∈Ji
The above linear models contribute to ensembles, but do not have good individual
accuracy.
As noticed in [Kor08], well performs combining the models (38) and (39):
X X
r̂ij = dj + |Ji |−0.5 wj2 j + |Ji0 |−0.5 0
wij 2
(rij2 − dj2 ) (40)
j2 ∈Ji j2 ∈Ji0
This method is listed as “Time-aware NB” in table 33, and its modification as PQ3
[Kor09b].
Table 33 lists RMSE’s of linear models given by various authors. At the bottom are
listed my implemenations. Some of the results are blended with a set of six predictors BA-
SIC2 (five regularized empirical user probabilities, and a regularized movie mean learned
on the residuals of user mean; slightly different from the set BASIC used in the previous
ensemble [Pat07]).
The per-item linear models have a similar form to K-NN with jointly learned weights
from multiple tasks [Bel07a], except that there is no limit on the number of nearest neigh-
bors, instead there is a normalizing factor, usually in the form |Ji |−0.5 , and also stochastic
gradient descent is used for learning instead of calculating the optimal weights in one step,
which is feasible for a small number of near neighbors. Because of the connection with K-
NN methods, per-item linear models were called global neighborhood models in [Kor08],
and I followed this convention here, placing their description in the section “Distance-
based methods”. The work [Kor08] notices also, that factorized versions of linear models
are closely related to NSVD, and [Kor08] proposes methods with factorized similarity,
binary input based and rating based, in item-oriented and user-oriented versions. The
factorized methods were described in section 4.3.5 “Improving user preferences”.
The per-item linear models, serving as methods that capture local item-item corre-
lations, complemented well the matrix factorizations models [Bel08] (see sections 4.8.1
and 4.8.2).
142
Table 33: Per-item linear models.
Method Version Weighting Dataset RMSE15 RMSEquiz
NB-LS-BIN [Tak08b] 0.9605
Neighborhood model [Kor08] 0.9002
Regression on sim. [Tos08b] 0.9229
Regr. on sim., w/unknown [Tos08b] RMSEprobe = 9278
Regr. on fact. sim., item [Tos08b] RMSEprobe = 9313
Regr. on fact. sim., user [Tos08b] RMSEprobe = 9371
Time-aware NB [Kor09b] 0.8885
PQ3 [Kor09b] 0.8870
RRBA binary 1 TrPrQu 1.0129 (∼1.0160)
RRBQ binary 1 PrQu 1.0423 (∼1.0450)
RRBAS+BASIC2 binary, weight. |Ji + 1|−0.5 0.9511 (∼0.9540)
RRBAS2+BASIC (LM [Pat07]) binary |Ji + 1|−0.5 0.9506 (∼0.9535)
RRBAS2+BASIC2 binary |Ji + 1|−0.5 0.9539 (∼0.9535)
RRBAL+BASIC2 binary 1/log(|Ji | + 1) 0.9505 (∼0.9560)
RRRAS+BASIC2 ratings, weight. |Ji + 1|−0.5 Tr 0.9361 (∼0.9390)
RR2 [Kor08] binary+ratings |Ji + 1|−0.5 0.9115 (∼0.9145)
r̂ij = µ + ci + dj + uTi vj
When the movie features vjk are learned and fixed, the user preferences ui can be estimated
by ridge regression:
ui = (ViT Vi + λi )−1 ViT yi
where the predicted outputs yi are ratings with global effects removed:
yij = rij − µ − ci − dj
and Vi is a slice of the vector of movie features V with rows selected by the vector of
indicators of observations Ji . An equivalent dual method is prediction by kernel ridge
regression (KRR):
ŷij = vj ViT (Vi ViT + λi )−1 yi
Other kernels than the linear one kj2 j = vjT vj2 can be used in the KRR prediction
equation:
ŷij = Kj:Ji (KJi :Ji + λi )−1 yi
A kernel K corresponds to ridge regression with some choice of features φ(vj ), which
are not defined directly. Among several tried out methods a Gaussian kernel defined on
143
normalized vectors from regularized SVD: xj = vj /||vj ||2 performed best in experiments
[Pat07].
kj2 j = exp(C0 ||xj2 − xj ||22 ) = C1 exp(C2 xTj2 xj )
This method is listed in table 34 as SVDKRRG with the constants C1 = 2.5, C2 = 1,
λ = 2, and SVDKRRG2 (named SVD KRR in [Pat07]) with the constants C1 = 2, C2 = 2,
λ = 0.5. The number of user observations (used user ratings) was limited to 500 per user
(selected were the most frequent items). Multitask kernel learning with a Gaussian kernel
(called also the RBF kernel) performed also well in other applications in [Law03, Law05].
It turned out that it is easy to augment KRR with time information, largely improving
both the accuracy of the method and the ensemble accuracy. Best results were obtained
with the following kernel:
On the above residuals the KRR variants were learned listed in the table 34 as KRRS2
and KRR3.
KRRS2 was KRRT based on a regularized SVD++ with K = 100 features. The biases
were learned as a preprocessing, the U and V variables were learned by alternating least
squares, all features at a time, and the weights W in the NSVD1 part were learned by
gradient descent. The number of ratings per user in KRRS2 was limited to 700.
KRR3 combined four different kernels and two ways of correcting the kernel for the
dates of ratings:
0
kj2 j = (C1 ∗ vjT2 vj + C2 ∗ vjT2 vj0 + C3 ∗ exp(2 ∗ (xTj2 xj − 1)) +
0
+ C4 ∗ exp(2(xjT2 x0j − 1)) ∗ (1 + C5 [tj2 = tj ][j2 6= j]) − C6 log(1 + |tj2 − tj |)
where vj come from a variant of SVD++ (RMSE15 = 0.8947; description skipped) with
features learned by alternating least squares, and with a special version of the NSVD
term. vj0 are
P the 100 first eigenvectors of the regularized empirical covariance matrix:
Covj2 j = ( i∈Ij j yij yij2 )(|Ij2 j | + 1000), with the negative values set to zero, and with
2
the diagonal set to zero, and the eigenvectors are multiplied by the square roots of the
corresponding eigenvalues. The yij values, on which the covariance matrix is calculated, are
residuals of biases and the NSVD term. xj and x0j are normalized features xj = vj /||vj ||2 ,
and x0j = vj0 /||vj0 ||2 . The manually picked constants were C1 = C2 = C3 = C4 = 0.45, C5 =
1.0, C6 = 0.08, λ = 0.5.
144
The SVDKCOV method is similar to KRR3, but combined two kernels:
0
kj2 j = (C1 ∗vjT2 vj0 +C2 ∗exp(C3 ∗(xTj2 xj −C4 )))∗(1+C5 [tj2 = tj ][j2 6= j])−C6 log(1+|tj2 −tj |)
where xj = vj /||vj ||2 , where vj are 30 features from the method SVDMT, described in
section 4.3.3, and vj0 are 70 features coming from the eigen decomposition of the covariance
matrix of residuals of SVDMT. The constants in SVDKCOV were chosen by automatic pa-
rameter tuning by the Praxis procedure, giving, after rounding: C1 = 0.9, C2 = 0.03, C3 =
6.8, C4 = 0.45, C5 = 0.75, C6 = 0.01. In the KRR in [Pat07] constant regularization was
used, but further experiments showed that a little better worked linear regularization:
λ = λ1 + λ2 |Ji |. For SVDKCOV automatic parameter tuning selected the following regu-
larization parameters (after rounding): λ1 = 16 and λ2 = 0.032.
Instead of using features from SVD, the parameters X in the reduced dimensionality
representation of the kernel can be learned in the multitask setting by gradient descent
[Law09]. For the Gaussian kernel kj2j = exp(CxTj2 xj ) used in [Pat07] (to simplify I drop
the constraint ||xj ||2 = 1) the gradient descent step for user i is following:
I have not tested gradient descent learning on the entire Netflix dataset, only on a subset
of the data (see the end of this section), and this method gave the best result among five
methods tested on the same data (see chapter 5 “Experimental results”).
The probabilistic model based version of the above methods is multitask learning of
Gaussian Processes, that is, we assume that all known and unknown ratings ri of a user i
are drawn from a multidimensional Gaussian distribution with a covariance matrix K:
ra−priori
i ∼ N (µ + ci + d + m, K)
ri ∼ N (µ + ci + d + m + K:,Ji KJ−1
i Ji
(yi − mi ), K − K:,Ji KJi Ji KJi ,: ) (42)
where yi = rJi − µ − ci − dJi . Note that, independently of the chosen kernel K, this
method fits the observed data exactly.
Another possibility is adding noise on the diagonal:
ra−priori
i ∼ N (µ + ci + d + m, K + σ 2 I)
Adding noise is justified by experiments with re-rating items [Ama09], which show that a
user often rates the same item differently, hence we do not observe the data exactly. The
posterior distribution is following:
ri ∼ N (µ+ci +d+m+K:,Ji (KJi Ji +σ 2 IJi )−1 (yi −mi ), K+σ 2 I−K:,Ji (KJi Ji +σ 2 IJi )−1 KJi ,: )
The kernel K can be learned in the multitask learning framework [Car97, Bak03]
with one task per each user. A simplification made here is treating observations as ran-
domly missing, ignoring the specific structure of missing data. Several variants of multi-
task Gaussian Processes models learned by expectation-maximization have been proposed
[Sch05, Yu05b, Yu07a, Yu07b, Yu09a, Yu09b]. It turned out, that for the Netflix Prize
data with 480, 189 users, the maximum likelihood step in EM methods gives good enough
results, even without regularization and without dimensionality reduction. I will describe
one variant of EM learned GP multitask model, the NPCA method [Yu09a], which had
very good accuracy on the Netflix Prize data.
145
NPCA [Yu09a] learns the hyperparameters mean m and covariance K with expectation-
maximization. In the expectation step for each user the posterior distribution (42) is cal-
culated, with m and K fixed. In the maximization step new K and m are calculated as
follows:
N
1 X
Knew = (Eyi EyiT + Cov yi ) =
N
i=1
N
1 X
=K+ K:,Ji KJ−1
i ,Ji
(y Ji − mJi )(y Ji − mJi )T −1
KJi ,Ji − KJi ,Ji KJi ,: (43)
N
i=1
N N
1 X 1 X
mnew =m+ Eyi = m + K:,Ji KJ−1 (yJi − mJi )
N N i
i=1 i=1
After moving K out of both sides of the formula (43) to reduce computational complexity,
we get an algorithm called Fast NPCA [Yu09a]:
We can note, that in NPCA the posterior variance at the points of observed data is
always equal to zero (the algorithm fits the data exactly), independently of the choice of
kernel K. Because 99% of data is missing, it has little importance for the estimation of
K by maximum likelihood (the algorithm averages over predictions for all data, observed
and missing),
Experiments with nonparametric estimation of single entries in the covariance matrix
K, with a similar method to the used in sections 4.2.2, 4.2.3 and 4.2.5, suggest that
the prior distribution on entries of K (if, to simplify, we assume that the entries are
independent), is similar to a Laplace distribution, or normal product distribution, and
not to a Gaussian distribution. It suggests to estimate the entries of covariance matrices
with L1 regularization, but I have not tested that possibility (also if we assume that
the multiplied ratings used for estimation are Gaussians, we should change the likelihood
function accordingly).
Related to NPCA are the algorithms BSRM [Zhu08] and NREM [Yu09b]. BSRM bases
on the Stochastic Relational Model (SRM) [Yu06a], a Gaussian Processes model, where
the covariance matrix is defined between the pairs (user,item):
Z ∼ N (0, ΣΩ)
yij ∼ N (zij , σ 2 )
hence Cov(zi2 j2 , zij ) = Σi2 i Ωj2 j , where the matrices Σ and Ω have the prior distribution
of inverse-Wishart Process [Zhu08]: Σ ∼ IW∞ (δ, Σ0 ), Ω ∼ IW∞ (δ, Ω0 ). BSRM [Zhu08] sa
a version of SRM with dimensionality reduction. A similar method to NPCA and BSRM
is NREM [Yu09b] (Nonparametric Random Effects Model). For both BSRM and NREM
the following use of the missing data structure was proposed – using as additional item
features the feature vectors from PCA of the binary indicator matrix of data observed (see
the method BSRM/F in [Zhu08] and NREM-2 in [Yu09b]). The reader is directed to the
works [Zhu08, Yu09b] for detailed descriptions of the methods BSRM and NREM.
146
My implementations of Fact NPCA and NREM with additional features taken from
regularized SVD confirmed the good individual accuracy, but did not improve the ensem-
ble and were not included in the final ensemble listed in chapter 5. Similarly, earlier I
implemented multitask GP [Yu07b, Yu05a] initialized to the SVD-based kernel used in
KRR [Pat07], and it was the most accurate method among all methods implemented by
me at that time, but it did not improve the ensemble accuracy.
The described kernel methods modelled m and K by maximum likelihood point es-
timation inside an EM algorithm, assuming identical m and K for all users, except for
modelling the time (modelling m|ti and K|ti in most KRR variants). A subject for future
research is how to properly make use of the structure of missing data, for example, by
modelling m|Ji and K|Ji . An untested method is to use dimensionality reduction similar
to NSVD:
X K
X X
−0.5
min (yij yij2 − |Ji | zjk zj2 k wj 3 k )
W,Z
ij∈T r,j2 ∈Ji k=1 j3 ∈Ji
Another unexplored possibility is predicting both ratings and indicators of observed data
with a hybrid of kernel ridge regression and kernel logistic regression. KRR may be well
suited to model the decrease of expected rating for non-user-selected data, for example by
using appropriately defined distance from the group of items rated by the user.
Table 34 lists results of different kernel methods on the Netflix Prize dataset.
147
Table 35: Multitask kernel methods. Small dataset.
Method K RMSEsmall
KRRG [Law09] 100 0.7415
NPCA ∞ 0.7520
the choice of g does not influence the resulting predictions by much (does not change the
weighting by similarity by much, and the nearest neighbors are the same independently
of g).
Multitask Gaussian Processes can also be used with an ordinal output, as it was done
in the hierarchical Bayesian framework in [Yu06b].
I have described so far in this chapter mainly the methods that contributed to ensem-
bles. A large majority of the methods implemented by me did not improve the ensemble
accuracy (about 90% of methods in my experience – experience of other teams was simi-
lar), and perhaps it is worth to tell a bit more about that part of the universe of methods,
that was explored, but turned out to be not useful. Most of these methods were dif-
ferent modifications of regularized SVD and KRR. A few other attempts not improving
my ensemble were, for example: multilayer neural networks, such as 3-layer and 4-layer
autoencoders (in some tries, initialized by RBM, as in deep networks [Sal09]); different
148
ways of postprocessing SVD features, such as random forests method, different kinds of
regression, KRR with kernels different from Gaussian; per-user regression using features
learned by different methods; “best movie decomposition”, where instead of learning user
preferences one most preferred movie was picked; an NN-like version of RBM, or variants
of SVD with clustered users or movies.
149
Table 37: Metadata – the most frequent 20 features.
Feature Count Movie IDs
1. English 10364 0 4 7 8 9 11 12 14 15 16 17 18 20 21 23 25 27 28 29 ...
2. USA 8853 0 2 4 7 8 9 11 12 14 15 17 18 20 21 25 27 28 29 30 ...
3. Drama 4743 7 15 17 18 19 23 25 27 28 29 35 41 43 46 50 54 55 ...
4. Comedy 3471 7 8 11 19 21 27 29 50 53 64 67 72 77 83 94 110 116 ...
5. independent-film 2826 8 14 15 17 21 23 30 41 51 65 66 74 79 88 106 109 ...
6. character-name-in-title 1886 27 30 31 35 45 50 56 60 70 72 93 94 112 117 160 164 ...
7. Thriller 1881 15 16 23 25 40 54 79 82 88 92 104 107 108 117 121 ...
8. Action 1721 12 15 16 25 47 54 57 65 68 76 77 83 88 90 108 117 ...
9. Romance 1652 11 17 19 21 29 35 49 53 62 94 116 147 155 160 163 ...
10. based-on-novel 1603 12 25 35 44 55 76 94 151 196 209 211 251 256 273 ...
11. murder 1486 11 12 17 23 25 54 55 57 121 124 149 174 185 196 ...
12. UK 1472 16 17 35 56 63 109 112 161 186 204 219 223 228 ...
13. Los Angeles, CA, USA 1323 21 25 29 64 77 79 106 107 108 121 126 129 136 154 ...
14. Crime 1186 16 25 54 55 107 108 122 126 136 146 167 174 185 ...
15. Documentary 1167 0 4 7 9 14 30 31 51 60 70 93 95 105 112 118 164 175 ...
16. female-nudity 1122 8 17 29 51 65 106 109 136 150 167 196 203 204 282 ...
17. Adventure 1093 12 15 19 27 45 47 50 57 65 76 77 83 88 117 240 243 ...
18. Horror 1071 8 15 23 40 66 92 121 128 130 150 171 187 196 209 ...
19. Family 983 0 19 27 34 45 47 50 67 72 77 83 151 154 238 251 254 ...
20. Sci-Fi 905 15 27 40 47 67 76 121 130 188 208 215 216 243 275 ...
... ... ... ...
comparison with other types of items, movies have relatively much meaningful metadata
available, it can be presumed, that the conclusion about small number of ratings being
better than metadata generalizes, and holds in situations of evaluation and recommenda-
tion of items other than movies. Metadata can be more useful for accurately predicting
the missing data structure (see the KDD’11 task, track 2 [Dro11]), or for adjusting rating
prediction for the missing data structure.
The Netflix Prize data contains movies with a lot of ratings – only 2 movies in the
Netflix Prize dataset have less than 10 ratings. The real situation in recommender sys-
tems is usually different, with a “long tail” of items with very few ratings. To overcome
the cold start problem for items, and be able to efficiently recommend rarely rated items
or new items without ratings, useful are content-augmented predictions. Several differing
approaches have been proposed how to do it. Metadata can be used in a hierarchical prob-
abilistic model, where a shared prior for user weights is learned with an EM algorithm
[Zha07a]. In a related approach, metadata is treated as fixed vectors of item features in
a matrix factorization [Tak07a] – the probabilistic versions of this concept are: Matchbox
[Ste09], BMFSI [Por10a], RLFM [Aga09], and a specialized model, fLDA, that models
topics, when item features are words [Aga10]. Another possibility is the already men-
tioned neural-network-type algorithm, movie-oriented NSVD1, adapted to model meta-
data [Pil09b]. Yet another way is predicting ratings directly, using linear regression with
metadata as predictors, with different sets of weights learned for different clusters of users
[Kag09]. Feature-Based Matrix Factorization [Che11b] unifies using internal and exter-
nal features, encompassing, for example, biases, SVD++, neighborhood information and
time effects. Bayesian Factorization Machines [Fre11] are a similar, more general frame-
work. [Hid12] factorizes item metadata, and uses this factorized representation to extend
a matrix factorization.
In this work a simple method of using metadata for content-based augmentation of
SVD was used (see sections 4.3.4 and 6.2.3). IMDb features are used to predict the SVD
item features using ridge regression, and the predictions of the regression are used as priors
150
for the SVD item features. This method was used to annotate and understand better the
automatically learned SVD features (section 4.3.4), and was also used for purely content-
based recommendations for movies rated after 2005 (where no ratings are available), added
to the recommender system described in 6.2.3.
A different simple heuristic method of using metadata was used in the SVD-based
recommender system described in section 6.1: the heuristic was to add artificial users
who like the given feature or a set of features. The advantage of artificial users is that
they can be used to improve an already implemented collaborative filtering algorithm
(for example, regularized SVD) without modifying it. The accuracy of this method was
not evaluated, but it works satisfactorily well in practice, fulfilling its goal of helping in
cold start situations. An analogous technique of creating artificial items can be tried out,
with the artificial item ratings (or otherwise expressed artificial preferences) dependent on
provided user metadata.
151
In this work as a preprocessing step most commonly were used different kinds of global
effects [Fun06, Bel07b, Pot08] (see section 4.3.1), and sometimes the NSVD term taken
from SVD++ variants. The middle-level layer were dimensionality reduction methods,
such as regularized SVD, SVD++, conditional RBM. On the Netflix data, this layer has
the largest modelling capability, and is very accurate even without the remaining, few-
parameter and highly-parameterized layers. As a postprocessing step well worked neig-
borhood modelling methods, such as K-NN [Bel07b, Tos08b, Pio09] or per-item linear
models, capturing large local item-item correlations. The method KRR/GP [Pat07] with
a Gaussian kernel has properties of the last two layers.
A useful observation is that parameter tuning is easy in the methods used in the last,
postprocessing layer, because there is no need to re-learn the remaining methods in the
stack.
Table 38 lists selected accurate combinations of methods by postprocessing. Some other
combinations were listed in table 31 section 4.5.1, and in table 29 in section 4.4.1.
RBM + KNN was the most accurate stacked method in the ensemble [Bel07c]: RBM
[Sal07a] was combined with a variant of K-NN with jointly learned weights [Bel07b]. NNMF
+ KNN [Bel07c] was non-negative matrix factorization postprocessed by the same KNN
method [Bel07b].
Neighborhood-aware MF [Tos09, Tos08b, Tos08a] is a weighted combination of matrix
factorization item-item K-NN model, and user-user K-NN, with inserting predictions for
the qualifying set as additional ratings.
HYBRID1 with NB correction S1 [Tak08c] combines MF with NSVD1, and postpro-
cesses the result with a neighborhood-based correction using cosine similarity between the
learned item feature vectors from MF.
The method SBRAMF-* + KNNMovieV3-2 is SBRAMF-UTB-UTF-MTF-ATF-MFF-
AFF [Tos09] (briefly described in section 4.3.6 “Time effects”), postprocessed by
KNNMovieV3-2 [Tos09] (a K-NN method similar to KNNMovieV3 [Tos08b]; see sec-
tion 4.5.1 ‘K-nearest neighbors”).
The method bk4-f200z4-nlpp1-knn3-1 [Pio09] is the integrated model bk4-f200z4 [Pio09]
(briefly described in the next section), with a nonlinear output transformation, and post-
processed by a K-NN variant with jointly derived weights [Bel07b], where the neighbors
were chosen so that the similarity does not exceed a fixed threshold. With RMSEquiz =
0.8713 it was the most accurate method for the Netflix Prize task among the methods
mentioned in this work. For a more detailed description of that method the reader is
directed to the paper [Pio09].
At the bottom of table 38 are listed my implementations. DRBMKNN is DRBM,
described in section 4.4.1 “Restricted Boltzmann Machines”, postprocessed by modified
KNNMovieV3 [Tos09], described in section 4.5.1. SVDMT2KNN is SVDMT2, described
in section 4.3.6 “Time effects”, postprocessed, as the earlier method, by the modified KN-
NMovieV3. SVDMT2KNN had the best accuracy among all single methods implemented
by me (if we count stacked methods as single methods, and e.g. combining by blending as
separate methods).
Stacking has the advantage that the combined methods are implemented once, and do
not need to be modified. With relatively little effort multiple combinations can be tested
(see, for example, the final solutions [Pio09, Tos09, Kor09b]). It is possible to automate
to some degree the process of searching for good combinations, for example, by checking
N 2 possible stackings of any two methods among the N methods implemented, but I have
not tried this option.
152
Table 38: Stacked methods.
Method Features K (NN) RMSE15 RMSEquiz
NNMF + KNN [Bel07c] 30 180 0.8953
CRBM + KNN [Bel07c] 100 50 0.8888
CRBM + KNN + KNNMovieV3 [Tos08b] 150 55+122 0.8832
Neighborhood-aware MF [Tos09, Tos08b, Tos08a] 100 50 0.8856
HYBRID1 with NB correction S1 [Tak08c] 400 40 0.8845
SBRAMF-* + KNNMovieV3-2 [Tos09] 150 141 0.8758
bk4-f200z4-nlpp1-knn3-1 [Pio09] 200 ≤ 60 0.8713
DRBMKNN 100 50 0.8888 (∼ 8920)
SVDMT2KNN 30 50 0.8819 (∼ 8850)
153
[Sal07a].
Another combination that significantly improved accuracy was merging regularized
SVD with neighborhood methods, like K-NN. Here the best results were obtained with per-
item linear models [Kor08] (see section 2.2), which can be understood as K-NN with jointly
learned weights, without the normalizing sum in the denominator, and without the limit on
the number of nearest neighbors. SVD methods with integrated neighborhood components
are an example of learning at multiple scales, where the additional scale of neighborhood
allows to model many strong item-item relations, which, because of their amount and
limited scope, cannot be captured by dimensionality reduction methods like SVD and
RBM. Other developed methods that modified the factorization model by modelling item-
item neighborhood were: matrix factorization with user factors augmented by item-item
similarity [Bel07a], and applying kernel methods like KRR [Pat07] (see section 4.5.3).
The most accurate models discovered for the Netflix data [Pio09, Kor09b, Tos09] inte-
grated multiple methods and many effects that improve accuracy (various global effects,
the use of implicit information, modelling neighborhood, and temporal effects, including
frequency effects). Some of the most complex integrated models were already described in
sections 4.3.5 “Improved user preferences” and 4.3.6 “Time effects”. Table 39 compares
RMSE of selected models of different authors.
“SVD++ integrated with NB” [Kor08] integrates SVD++ [Bel07e] (section 4.3.5) with
the global neighborhood term [Kor08] (per-item linear models, section 4.5.2). “Time-aware
SVD++ integrated with NB” [Bel08] further adds time-changing user biases, movie biases,
and user preferences.
Among the published models, a single model that obtained best accuracy in the Netflix
Prize was BK4 [Pio09], that is the mentioned above time-aware SVD++ integrated with
the global neighborhood term [Bel08], extended by using additional time effects: factor-
ization of biases vs. time, time- and frequency-dependent movie features, movie biases,
implicit feedback term, neighborhood term, and per-user scaling term. Table 39 lists a
variant of BK4 named bk4-f200z4 [Pio09]. I quote the general form of the BK4 model
in the following listing to give the reader an idea of its complexity level. For a detailed
description of the parameters the reader is referred to [Pio09].
154
Much simpler than the above model BK4, while having close accuracy, is the model
PQ2 [Kor09b], briefly described in section 4.3.6.
Finally, let’s remark that there were accurate combinations of methods, for which it
is unclear, how to create a proper integrated model, for example, RBM postprocessed by
K-NN [Bel07c].
155
maintaining an ensemble of diverse methods was the possibility to evaluate how different
methods combine with each other, what was useful in exploring the space of possible
models, and gave insights, in which direction to further develop the algorithms.
The methods to be blended were learned on the training set with a chosen hold-out
test set excluded. Each method produced one or more predictors (sets of predictors) for
the test set (ratings unused in training), and the resulting predictors were blended to
predict ratings on the test set, with the goal to obtain best generalization accuracy on the
additional validation set (RMSEquiz or RMSEtest ).
The Netflix Prize dataset and various choices of the hold-out test set were discussed
in sections 3.3 and 3.4. One way was using the entire probe set as the hold-out set, but a
disadvantage of that choice was the necessity to retrain methods on the entire training set,
after calculating the regression weights. A compromise between accuracy and convenience
was using only a portion of the probe set as hold-out, so that the methods needed only to
be trained once. For example, in [Pat07] and in this work 15% of probe is used as hold-out,
and in [Tak08a] 10% of probe was used, chosen so that RMSE10 ≈ RMSEquiz .
In prediction contests such as the Netflix Prize, typically an even more efficient method
of hold-out blending is possible. In [Kor09b, Tos09] a method of using quiz (validation)
set for blending was proposed, which allowed to include the entire probe set in one-time
training of methods. The method is all the more interesting because for the quiz set
no ratings were available, only the values of RMSEquiz reported by the Netflix Prize
automatic evaluation system. The method in [Kor09b, Tos09] was to reformulate the task
of regression, so that the resulting regression coefficients depend only on RMSEquiz and on
the sufficient statistics not containing the response variable (ratings). Let X be the matrix
of predictors and let y be the predicted response vector. To make predictions ŷnew = xTnew β̂
we need to calculate β̂:
β̂ = (X T X + λI)−1 X T y
The terms containing y are: X
ci = xij yj
j
1 X 2 X X
ci = xij + yj2 − (xij − yj )2
2
j j j
We see that β̂ can be calculated using MSE values of individual predictors on the validation
set with unknown response values. The remaining statistics needed depend only on the
predictors, which values we have.
In my experiments I used blending on 15% of the probe set, with methods trained once
on the training set with the 15% of the probe set excluded. I have not used all produced
predictors for blending – feature selection was necessary. Not all produced predictors
improve the ensemble accuracy, and using a large number of noisy, unnecessary predictors
can significantly decrease the accuracy. Considerations about model selection for linear
models give criteria as AIC, BIC (see section 2.4), which suggest modifying the likelihood
of the observed data by a penalty linearly growing with each additional predictor. For a
predictor to be useful, its contribution to explaining variance (or to increase the likelihood
of data) must outweigh a certain threshold, which is to be determined, for example, by
cross-validation (with a remark, that one fixed threshold for all predictors is a simplification
– it should depend e.g. on correlation with other predictors).
Feature selection was especially needed after extending the linear regression by multi-
plications of predictors (called two-way interactions). In this work’s ensemble 69 predictors
and 81 two-way interactions were used (in comparison to the previous ensemble [Pat07]
156
with 56 predictors and 63 two-way interactions). The two-way interactions for the ensem-
ble were chosen using two criteria: statistical significance in the linear regression, and the
drop of the sum of squared errors. The selected interactions with their contribution are
listed in chapter 5. In a similar way, in [Pio09] 5 multiplicative interactions selected by
forward feature selection were added to their ensemble.
Feature-weighted linear stacking (FWLS) [Sil09] can be understood as ridge regres-
sion with two-way interactions. FWLS added to the ridge regression multiplications of all
predictors with a chosen set of 25 meta-features such as the logarithm of movie support,
the average number of movie ratings for the movies rated by the user, or standard devia-
tion of SVD prediction. As a result, the accuracy improved from RMSEtest = 0.863377 to
RMSEtest = 0.861405 – a similar order of improvement to the observed in the experiments
in this work (chapter 5) after adding 81 two-way interactions to the linear regression.
For the predictors in my ensemble, linear regression without regularization gave very
close accuracy to using a small amount of regularization (perhaps it is an exceptional
situation because of large collinearity between some predictors). Using regularization has
the advantage that it makes possible to use a lot more interactions (even with no feature
selection, like in FWLS [Sil09]) without causing large overfitting.
The number of possible two-way interactions is large, and we could try to model
them using fewer parameters than one parameter per interaction. In a similar fashion
to Factorization Machines [Fre11], we could PP 1add to the
PPlinear model terms modelling
2
interactions between groups of variables: ( i=1 αi xki )( j=1 γj xlj ). I have not tried this
option.
It turned out that well chosen nonlinear blending methods improve accuracy. Very
good accuracy was obtained by neural networks with a small number (10-30) of hidden
variables. In [Tos08b] neural networks blending improved RMSE by 0.0020 in comparison
with linear blending. A similar method to NN blending [Tos08b] was used in [Pio09]. In
[Tos09] NN blending was extended to using two hidden layers.
[Tos09] further proposed Ensemble NN blending, which gave best accuracy among all
blending methods described in [Tos09]. Ensemble NN blending relied on learning random
k = 4 predictors, with two additional support-based input predictors log(|Ji | + 1) and
log(|Ij | + 1). In the best setting, which gave RMSE= 0.8583 in [Tos09], NN with 5 hidden
neurons were used, the process of drawing the four predictors was repeated N = 1060
times, and the resulting 1060 new predictors were combined by binned ridge regression,
on 4 support-based bins.
In [Pio09] a multi-stage classifier was used, that relied on learning several blending
methods on residuals of each other. The authors concluded that NN blending [Tos08b,
Pio09] was superior to their multi-stage classifier.
Accurate, but with slightly worse accuracy than NN blending, were variants of Gradient
Boosted Decision Trees (GBDT) [Kor09b, Tos09] (called also Gradient Boosted Machine)
and Bagged GBDT (BGBDT) [Tos09, Jah10]. GBDT can be used for prediction tasks
with any loss function, as a black box, with minimal parameter tuning, and it often leads
to good accuracy. It makes them a convenient first-choice technique in prediction contests.
Other blending methods used on the Netflix task were: polynomial regression, which
adds higher order terms of predictors [Tos09], kernel ridge regression with Gaussian kernel
[Tos09], binned linear regression using support-, date-, frequency-, and clustering-based
bins [Bel08, Tos09]. In some cases features extracted from SVD, RBM and K-NN [Tos09,
Kor09b] were used as additional predictors.
Most individual methods have at least several constant parameters to tune. Instead of
tuning those parameters to optimize RMSE of an individual method, the RMSE of the
blend can be optimized instead. I have not used this method in this work, but tuning to
optimize the blend was used extensively in [Pio09, Tos09] for automatic parameter tuning.
157
Optimizing blend accuracy was particularly effective for K-NN methods.
Summarizing the subject of blending predictors, linear regression worked well (regu-
larization usually improved accuracy), and among the nonlinear blending methods best
results gave neural networks blending, ensemble neural networks blending and decision
trees.
In practical applications rarely there is a need to use more than one method, because
the cost of complicating the algorithm is rarely outweighted by a relatively small accuracy
gain. Nevertheless, the framework of blending, in particular simple regression blending,
is convenient to evaluate developed models, and explore the space of possible models in
search of the most accurate ones.
158
“By three methods we may learn wisdom:
first, by reflection, which is noblest;
second, by imitation, which is easiest;
and third by experience, which is the bitterest.”
Confucius
5 Experimental Results
This chapter summarizes the results of my experiments on the Netflix Prize dataset. I list
the developed ensemble of methods in order of feature selection. I discuss the results, and
compare with the results published by others.
All implemented methods were trained once on the training.txt, with 15% of probe
set excluded. The 15% of probe set serves as a hold-out set, on which RMSE is calculated
(called RMSE15 ). The tables list RMSE15 of the current ensemble part, and, where ap-
plicable, RMSE15 of individual methods. Other RMSE values appearing in this chapter
are RMSEquiz , that is results of validation reported by the automatic evaluation system
by Netflix during the contest, and RMSEtest , that is the final validation results, made
available after the Netflix Prize contest ended. Besides the above, one table gathers values
RMSEsmall from experiments on a small subset of Netflix data, which is 50% dense (as
opposed to 1.1% dense whole data). The different types of RMSE were described in more
detail (on which hold-out set each RMSE is computed, and what is the corresponding
training set) in section 3.4 “Evaluation”.
The developed ensemble contained 69 predictors and 81 two-way interactions between
them, all blended by linear regression. Only a small part of all implemented variants of
methods (less than 10%) gave an improvement of accuracy large enough to add them to
the ensemble. Of those included, some methods produced several predictors, but most
produced only one predictor.
First I add to the ensemble the global effects that were used as a preprocessing for
more complex methods, which will be added to the ensemble at a later stage (including
the global effects first will allow us better assess the relative importance of the remaining
methods in the ensemble). Table 40 lists 13 predictors coming from global effects, along
with cumulative RMSE15 after adding each subsequent predictor to the ensemble.
159
Table 41 lists all remaining methods in the ensemble, added in order of greedy forward
feature selection (GFFS). As a next feature is chosen the predictor that results in best
RMSE15 when combined with all predictors included earlier. The table list the individual
accuracy of the method as “RMSE15 individual” (where applicable), and the accuracy of
the current fraction of the ensemble as “RMSE15 combined”. A simple implementation
of GFFS was used with computational complexity O(N P 3 + P 5 ), where N = 211365 is
the number of observations (15% of the probe set), and P is the number of considered
features. The complexity of GFFS can be improved, if necessary, to O(N P 2 + P 4 ) [Pil09a].
Table 41: Predictors in the final ensemble, sorted by greedy feature selection
160
No Method
Global RMSE15 RMSE15 Description Described
Effects individual combined in
59 KME1AVG 0.86662 K-means users on res. of K-means 4.6
60 SVDF1TO6 7-8 0.9325 0.86662 Features 1-6 from RSVD2 4.3.2
61 SVD1N19 7-8 0.9073 0.86662 RSVD K=85, nonlinear postproc. 4.3.2
62 KNN0 0.86662 Residual on GE of 1-NN 4.5.1, [Pat07]
63 RSVD 1-6 0.9155 0.86661 RSVD K=100 4.3.2, [Fun06]
64 SVD4 0.86661 RSVD, modified
65 DRBM2 0.9175 0.86661 Directed RBM K=100 4.4.1, [Sal07a]
66 SVD1N1 11-13 0.9110 0.86661 RSVD2 K=30, changed no. of iter. 4.3.2
67 SVD5 0.86661 RSVD K=110, modified
68 NSVD1 0.9328 0.86661 NSVD1 K=40 4.3.5, [Pat07]
69 NSVD2 0.9624 0.86661 NSVD2 K=5 with qualifying data 4.3.5, [Pat07]
All experiments were performed on PC’s with 1-2GB RAM and a 1.8-2GHz processor.
Most methods needed several hours to train (per method), but a few methods needed up
to several days to tune the parameters.
The full blend contained the listed 69 predictors, and additionally 81 two-way mul-
tiplicative interactions, which improve RMSE15 from 0.86661 to 0.86492. Good practice
in regression is including all predictors that form interactions also as individual predic-
tors, but ultimately three variables appeared only in two-way interactions and were not
described in the table 41. They are: TIME - the date of the rating, RMOV – the index of
the movie in the original, unsorted training set, and NSVDSVD is NSVD1B with K = 10,
postprocessed by RSV D2 with K = 70.
While developing subsequent methods, even having best individual accuracy among all
implemented methods did not guarantee that adding the method to the ensemble would
improve the accuracy. Also, several accurate methods and differing from others, as NPCA
[Yu09a] and NREM [Yu09b], did not improve the ensemble accuracy.
Among the above listed 69 + 3 predictors forming the final ensemble, the following 55
predictors remained from the previous ensemble [Pat07]: RSVD, RSVD2, SVDKRRG, SVD-
KRRG2, SVDRR, SRR4, SRR5, NSVD1, NSVD2, QNSVD1, NSVD1B, NSVD1R, NSVD2A, NSVDSVD,
RRBA, RRBQ, RRBAS, RRBAS2, RRBAL, RRRAS, SVDF1TO6, SVDF7TO12, SVD1N1, SVD1N9A,
SVD1N19, SVDB1, SVDB2A, SVD4, SVD5, KNN0, KNN1, KNN20, KNN4A, KME0AVG, KME1AVG,
log(CMOV), CMEMB, log(CMEMB), MOV, MEMB, RMOV, EMOV, and all global effects 1-13.
The following 17 predictors were added later: JT2, JT3, KRRS2, KRRT2, KRRT, SVDKCOV,
KRR3, KRR3A, SVDM, SVDMT, SVDMT2KNN, DRBM, DRBM2, DRBM3, DRBMKNN, RR2, TIME.
Table 42 lists the used two-way interactions in order of GFFS using the criterion of
RMSE15 (the same method used earlier for ordering the predictors in table 41).
The two-way interactions were chosen among 69 ∗ (68/2 + 4) = 2622 possible. This
number seems large, but feature selection of new interactions was performed each time after
developing a new method, and the set of considered interactions was largely narrowed by
the first criterion of feature selection used – statistical significance in the linear regression.
While developing subsequent methods and adding and removing predictors and inter-
actions to the ensemble, different criteria of feature selection were tried (among others,
dropterm(), addterm(), stepAIC() in R package, and BIC, AIC criteria). Ultimately, based
on experiments with an additional split of the hold-out set, I decided to use two criteria
for the feature selection: the first was the mentioned statistical significance in the linear
regression, and the second was the amount of drop of the sum of squared error (SSE), cal-
culated on the 15% of probe set, that served as the training set for the linear regression.
If SSE dropped less than by a fixed threshold, the predictor (or interaction) was rejected.
Observing table 42 with interactions sorted by greedy feature selection, removing the
last 20-30 interactions rather would not reduce the validation error (and perhaps 5-10
161
predictors from table 41, although they create some of the later added interactions).
Table 42: Two-way multiplicative interactions in the final ensemble, sorted by greedy
feature selection
The linear regression was used for blending without regularization, because, for my
ensemble, adding regularization gave no improvement on the validation set. It is possible
that if I used regularization (ridge regression), feature selection would not be necessary. As
a side note, ridge regression with two-way interactions is similar to kernel ridge regression
(KRR) with polynomial kernel, which in turn is close to KRR with Gaussian kernel.
The ensemble listed in tables 40, 41, 42, containing 69 predictors and 81 two-way inter-
actions combined by linear blending, had RMSE15 = 0.86492. The ensemble was addition-
ally blended in proportion 90%:10% with JT1 model [Tom07] that had RMSEquiz = 0.8805
(described in section 4.3.3 as Bayesian PCA). The final validation error after blending was
RMSEquiz = 0.8694, and RMSEtest = 0.8703 (8.63% better than the reference algorithm
162
Netflix Cinematch, with RMSEtest = 0.9514), taking 34th place in the Netflix Prize con-
test, among 5169 competing teams that submitted at least one solution. Without the
methods JT1,JT2,JT3 (with a remark that they inspired multiple methods in the ensem-
ble), the validation error of the ensemble was RMSEquiz = 0.8707 and RMSEtest = 0.8717
(8.48% better than Cinematch).
Comparing to the best obtained results, the Netflix Prize competition was won by a
team of 7 people, Bellkor’s Pragmatic Chaos, with an ensemble of more than 450 predic-
tors [Kor09b, Tos09, Pio09, Kor08, Tos08b, Bel07c], that gave RMSEquiz = 0.8554 and
RMSEtest = 0.856704 (10.06% better than Cinematch). Second place was taken by The
Ensemble, a team of over 30 people, with RMSEquiz = 0.8553 and RMSEtest = 0.856714
(10.06% better than Cinematch).
Looking at the ensemble of the winners, the main reasons of better accuracy, comparing
to our ensemble, seem to be: many more variants of methods implemented, discovery of
the frequency effect, more features in the dimensionality reduction methods, extensive
automatic parameter tuning, including optimizing the blend accuracy, and better methods
of blending. During the contest, our highest reached place were 2nd at the moment of
forming the two-person team in September 2007: the method JT1 was merged 50% : 50%
with the ensemble [Pat07] enhanced by KRRT. The advantages of our solution over others
at that time were that the methods JT1 and KRRT used the date variable, and that the
Bayesian model JT1 was exceptionally accurate with RMSEquiz = 0.8805 (7.45% better
than the Cinematch).
Summarizing the obtained ordering of features, the first few methods from the table 41
explains most of the explainable variance of ratings. The part of the ensemble containing
global effects 1-13 and the first six methods 14-19 has RMSE15 = 0.86958, only 0.5% larger
than the RMSE of all methods: RMSE15 = 0.86492 (the difference in validation error is
even smaller, because every feature added to the ensemble causes small overfitting, and
decreases accuracy because of that). As discussed earlier, situations of analysis of real-
life data can be understood as follows: the data was generated by a certain unknown
model, and the optimal method realizes Bayesian inference in this unknown model. We
can suppose that the methods found and the identified effects in data, with effort of many
people during almost three years of the Netflix contest, provide accuracy close to the best
possible. It leads us to the conclusion that it is likely that the optimal method is close
to being some combination (not known precisely) of matrix factorization, RBM, kernel
methods, and K-NN, all with including the time effects and using the structure of missing
data (Conditional RBM, NSVD, SVD++).
To complement, table 43 lists all RMSEsmall values appearing in the previous chapters.
Some methods have too large computational complexity to conveniently learn them on the
whole Netflix Prize dataset, or to tune parameters using the whole dataset. In particular,
this applies to methods using straightforward missing data imputation. The methods were
trained on a small subset of the Netflix data, which is relatively dense. The training set
contained exactly 10, 000 users and 500 movies, and 50% values are missing in it. The
test set contained the remaining 6% of values – randomly selected 304, 500 ratings. How
the small dataset was chosen, was described in section 3.3 “Dataset”. The best accuracy
among the four methods had KRR with Gaussian kernel, learned by gradient descent.
NPCA, which has very good accuracy on the whole Netflix dataset, here gives worse
results, probably because the maximum likelihood estimation of the covariance matrix in
the EM method overfits the data when averaging over only 10, 000 users.
Summarizing, the contest ended with crossing the artificial limit of 10% improvement
of the reference algorithm, many accurate prediction methods were developed, but ques-
tions are still open: which single method is the best for the Netflix Prize task? Which
probabilistic model generated the data? What is the best possible predictive accuracy?
163
Table 43: List of experiments on the small dataset.
The above-listed methods minimized RMSE by modelling the expected rating Erij .
To produce lists of recommendations, I advocate in this work correcting the predicted
p ex-
pected item ratings by multiplicity of their posterior standard deviations ŝij = V ar rij .
I have not put much effort into modelling the standard deviation of predictions (I used a
non-constant parameter σ for the modelled ratings rij ∼ N (µij , σ 2 ) only in a few models
inspired by the JT1 model, which were p not included in the ensemble). A roughpsimplifi-
cation is estimating by ŝij ≈ ŝ + C/ Nj + λ, and sorting items by Erij − C/ Nj + λ
to obtain the top items, with an optional additional correction for the missing data struc-
ture. Because sorting items is a situation of multiple comparisons, the constant C should
increase with the increasing number of all items. Based on the experiments from sec-
tion 4.2.4 I roughly estimate, that for the Netflix dataset the best constant should be in
range 50 − 500 (additionally, the constant should be larger for users with few ratings). The
resulting personalized ranking of items (personalized recommendations) could be evalu-
ated, but because I proposed only one, heuristic method of calculating recommendations,
and because the popular measures evaluating rankings are not completely satisfying (e.g.
they ignore the missing data structure), I narrow the experiments to the well defined task
of prediction of movie ratings from the data distribution, evaluated by RMSE.
164
“C’est en faisant n’importe quoi
qu’on devient n’importe qui”
Rémi Gaillard 3
6 Applications
In the analysis in earlier chapters I paid attention to the perspective of using the outlined
prediction methods in a real recommender system. A natural complement of the work
will be describing the use of the developed collaborative filtering methods in deployed
recommender systems. This chapter describes parts of two applications serving online
personalized recommendations. The recommendations are based on the regularized SVD
methods, which analysis was the major focus of this work. The recommender system
projects, which fragments are described in this chapter, are independent undertakings,
and are not parts of this paper’s experimental work.
How precise data mining solution is needed, it depends on the application. We can
distinguish the following levels of need for recommendations in different applications (see
also the discussion of the importance of predictive accuracy in general in section 2.5):
1. No recommendations needed – this is the usual case.
2. Non-personalized recommendations – lists of top items.
3. Any personalized recommendations that fill recommendation slots – can be of average
quality, more important is the ease and speed of deployment.
4. Good quality personalized recommendations.
The two applications described in this chapter assumed the fourth level of needs.
The first project is a WWW application, where recommendations are calculated by
a server. Recommendations have the form of a list displayed on an HTML website, and
are updated with an AJAX-like mechanism. User preferences and recommendations are
calculated and displayed instantly after giving a rating by the user, and item features are
periodically updated.
The second project is a standalone Flash application containing a set of interactive
visualizations, with server-less recommendations, calculated within the application. Rec-
ommendations are marked on a 2D map of items. Similarly as in the first application,
user preferences and recommendations are updated instantly after giving a rating by the
user. Item features, item clustering, and the chosen 2D visualization are precomputed and
fixed. The application contains also search by title and filtering by genres, including the
non-standard genres defined by the learned SVD features. The set of movies from the
Netflix data is extended by a small number of popular movies released after 2006, for
which content-based recommendations are provided, based on IMDb keywords. The Flash
application, including all data (over 2000 movies and TV series) has a very small size: less
than 150Kb.
Observing various emerging implementations of recommender systems by different de-
velopers, one can notice common inefficiencies or mistakes, which I strived to avoid here:
using K-NN methods instead of more accurate and faster regularized SVD (matrix factor-
ization), using SVD/PCA without regularization, using inaccurate regularization, using
SVD for binary data, instead of using generalized SVD, making recommendations by sort-
ing by expected rating without adjusting for variance (uncertainty) of predictions, predict-
ing behavior on training data gathered from a recommender system, without adjustment
for missing data structure, relying too much on implicit feedback (passively gathered data),
without paying attention to the loopback mechanism that reinforces mistakes made by the
3
“It’s by doing whatever that you become whoever” or “Nonsense makes you become anyone”
165
recommender system, not ensuring diversity on recommendation lists, not securing against
shilling attacks.
The chapter is concluded by discussing the use of collaborative filtering in fields other
than recommender systems.
166
plify, I fully re-learned the parameters, using 20 learning iterations, every five minutes.
The newly learned item features replaced the old ones after the learning was completed,
locking the database only for a very short moment.
To reduce the amount of computation and the memory occupancy without a large loss
of quality of recommendations we could take advantage of the usual long tail structure of
the dataset of ratings gathered. If there are many items to choose, the set of items used
in recommendations can be limited to some number of the most popular ones (because
predictions for rarely rated items are burdened with a large variance, those items will
rarely enter the list of recommendations anyway), or better, a set of items that high in
a non-personalized global ranking. Similarly, with a lot of users, in the phase of calcu-
lating item features, users with few ratings can be skipped. Because the calculations of
the preferences in SVD are independent tasks when the item features are fixed, and the
calculations of the item preferences are independent when the user preferences are fixed,
it is easy to parallelize the calculations on multiple processors, for example, using the
instruction “parallel for” from the library OpenMP. Also, in some situations, calculating
recommendations (or parts of the calculation) can be moved to the client computer – this
idea is realized in the server-less application described in the next section. If there are
many items to evaluate, another possible speed-up is using “lazy” computation: first cal-
culating approximate personalized predictions for each item, for example, using only the
biases, the first (most significant) 5-10 features, and subtracting the variance correction.
The approximate scores serve as a filter for the full evaluation – if the approximate score is
too low, it is unlikely that the item will enter the user’s top-K list, and further calculations
for that item can be skipped.
The above outlined concept of an SVD-based recommender system was realized in
the mentioned WWW application “recommender system for everything” recommending
websites. Users interact with the application through an HTML/JavaScript interface con-
sisting of several pages: logging in, a list of recommended websites, a list of rated websites,
and a list of skipped websites. Additionally, an input box was added allowing to add new
websites to the system, and a search input box allowing to filter recommendations by
keywords from website descriptions (titles).
167
a website thumbnail sliding on hover over a link. Browsing the lists and the mechanism
of rating was realized by an Ajax-like interface, sending XMLHttpRequest queries to the
server. The set of items was initialized to several hundred links, including: most frequently
visited websites by alexa.com, wikipedia pages, popular movies (IMDb), books, video clips,
etc. The application was launched in December 2008 under the names “svdsystem” and
“lolrate”, was active for several months and had about 300 users.
During operation recommendations were not adjusted for prediction variance (the ad-
justment would make no visible difference, because the differences between the numbers
of item ratings were small). Later, calculating lists of recommendations has been modified
by subtracting from the expected rating a multiple of the approximate standard deviation
of the prediction. Such adjustment is needed when the gathered item ratings have a “long
tail” structure, as explained in sections 4.2.4 and 3.4.
On the implementation side, “recommender system for everything” is a Linux-based
HTTP server, implemented in C++/C with occasional use of Fortran libraries. The ap-
plication serves several HTML and XML pages, updates the database, and calculates
recommendations. To allow rapid access to the stored data, the database is realized as a
file mapped to memory with the function mmap(), treated as shared memory with the ac-
cess guarded by a mutex lock. The server is process-based, similarly as the Apache server,
but without a speed optimization by keeping a thread pool. Other realizations are possi-
ble, such as thread-based or one-thread with epoll() listening, but in these two variants
an obstacle for a high-performance server would be a limit on the number of open files in
Linux.
The algorithm used was a variant of the basic regularized SVD with constant-linear
regularization. The application contained a procedure performing automatic tuning of the
regularization parameters, intended to be periodically run. An obvious shortcoming of
the regularized SVD is not taking into account the structure missing data. Importance of
missing data is visible, for example, when calculating the average rating of movies (see
discussion in section 4.2.4). A partial fix is using methods like SVD++, which modify
user preferences by a prior depending on missing data. Another, complementary solution
could be a heuristic used in the recommender system described in section 6.2.3, penalizing
unknown (rarely rated) regions of the set of all items. It remains to be determined, how
important is the correction for missing data to the ultimate goal of producing an ordering
of items useful and satisfying for the user.
If a recommender system starts from a small amount of data, this situation is called
“cold-start problem” (mentioned in section 3.1). In such a situation it is particularly im-
portant to make best use of available data, including the additional information about
items and users. In the described “recommender system for everything” to support over-
coming the cold-start problem well worked a heuristic solution of creating artificial users,
who “like” an artificial subset of items, e.g. action movies, and “do not like” a random
subset of remaining items. For larger groups of items (such as movies or books) 3-4 arti-
ficial users were added who “like” all items in the given category, and for smaller groups
of items one artificial user was added per group. This heuristic method has the advantage
that it does not require creating a specialized model using side information about items –
a purely collaborative filtering recommender system can be used unchanged. Adding arti-
ficial users can be performed by a person untrained in machine learning. The heuristic of
artificial users worked satisfactorily well, but useful would be evaluating it in comparison
with model-based use of side information (more about using metadata in section 4.7).
In services gathering additional user data (the described recommender service did not)
an analogous technique of creating artificial items can be used, to help with the cold start
problem for new users. The idea of adding artificial users or items is similar to using
conjugate priors (which behave as additional observations) in probabilistic models.
168
The above-described recommender system did not provide diversified recommendations
(does not avoid recommending very similar items). A simple idea to correct for diversity
is to use clustering (the possibility mentioned in [Pat07]), for example, to limit the recom-
mendation list to up to one item from one cluster. Another possibility is recommending
whole clusters (this option was used in the Flash application described in the next section).
Instead of using precalculated clustering, we could remove one-by-one items that are very
similar to other items higher on the recommendation list.
Another not implemented feature, but that could be useful to adapt the described
recommender system to different domains, is detecting identical or equivalent items and
removing them from recommendation lists, or merging them in the database. Many other
appearing issues to tackle in a recommender system were listed in section 3.1.
One can wonder what is the best number of features to learn in the regularized SVD.
Based on own observations about what information the features learn, and what is the im-
pact of individual features my suggestion is: 16-32 features should be enough to sufficiently
explain user preferences in most single-domain applications as movies, music, books, but
more features, up to 100-200, are needed in broader, cross-domain recommender systems,
as the described “recommender system for everything”. Because SVD is not capable of
fully modelling a large amount of local item-item correlations, it may be helpful to augment
the factorization model with a neighborhood-based component.
6.2.1 Clustering
Clustering, that is grouping similar objects together, has some applications in recom-
mender systems, for example, clustering can be used to improve the user interface, clus-
tering can help to add diversity to recommendations, as we mentioned, or to speed up the
collaborative filtering algorithm at the cost of accuracy.
The 2D recommender system described here uses clustering for two purposes. One is
to reduce the number of displayed points on the 2D map, displaying clusters instead of
individual items, and giving in effect a more convenient and faster interface, and reduc-
ing the overwhelming choice for the user. The second purpose is that whole clusters are
recommended. The 2D recommender system recommends 10 clusters of movies at a time,
instead of 10 movies. A list of best clusters is more diverse than a list of best movies, and
personalized predictions within one cluster are similar anyway.
How to create good quality clustering for the Netflix data, and what “good quality”
clustering means? Like other tasks of data exploration, the answer depends on the ap-
plication, and is partially based on experience. One could try to formulate a criterion
evaluating a clustering, that rewards situations when items in one cluster are rated by
users similarly, and rewards clusters of certain size – penalizing too big and too small clus-
169
ters, but instead of optimizing a chosen cost criterion, I tried out several commonly used
clustering algorithms on the K-dimensional space of item features coming from the learned
regularized SVD (the model used for recommendations). In [Pat07] it was mentioned that
good results are obtained with single linkage hierarchical clustering using the Euclidean
distance between item features. In later experiments it turned out that a more balanced
clustering is obtained using the k-means method, initialized to a clustering calculated by
a greedy heuristic. The resulting clustering was used in the sub-applications described
in the next two subsections (interactive visualization and 2D recommender system). The
clustering by k-means can be further improved by limiting the maximal size of a cluster,
for example, by repeatedly, randomly splitting the largest cluster in the last iterations of
k-means.
6.2.2 2D visualization
The described set of Flash tools contained an embedding of movie clusters on a plane,
used in interactive visualizations to explore similarities between movies, and used in the
associated “2D recommender system”.
Visualizations on a plane are used very often in data exploration. Visualizations are
addressed to a human, and their purpose and to communicate information in an efficient
way, and to emphasise important aspects of data. Locations of points, colors of points,
size, and shape seem to be a more natural medium of communication (their reception
requires less effort) than using words, numbers, lists.
Here we focus on visualization of data in two dimensions by using simple mediums:
points and lines, with varying location, color and size. A properly chosen visualization al-
lows to present the data in a way that allows to see interesting properties of the data, where
what “interesting” means needs to be clarified – it depends on the application, depends
on what we expect to find in a given kind of data. The perspective of a user unaccustomed
with mathematics and machine learning, who just wants to explore similar movies, or get
personalized recommendations, differs from the perspective of a data analyst, who wants
to utilize visualizations to improve a developed data mining solution, gain deeper insights
into the data. Both perspectives were important while developing the tool. As for the data
analyst perspective, the intention here is to find ways to further improve the regularized
SVD models, find new effects, reduce the number dimensions from the initial 32, discover
a meaningful clustering of items, identify the right nonlinear transformations of features,
identify probability distributions in data. Realizing exploration and model identification
goals, such as the mentioned above, can be difficult when working only with numerical
data. Usually it is more effective and convenient to utilize the pattern matching abilities of
the human brain, and use visualizations. Various kinds of visualizations, from histograms,
boxplots, quantile-quantile plots, to scatter plots, spline smoothing, regression diagnostics,
were extensively used while developing the ensemble described earlier in the work.
The visualization developed here compresses the 32-dimensional vectors of movie fea-
tures, coming from the regularized SVD, into two dimensions. The 32-dimensional feature
representation from SVD happens to define a well working distance between items, by
taking the cosine similarity between feature vectors [NF07, Pat07]. Differing forms of the
regularized SVD model with different learning methods give much the same distances
between movies. I decided on using a Variational Bayesian version of SVD, with fully
learning one feature at a time (requiring 10-20 iterations for each feature), and repeating
the learning of all features three times. The property, that the features are roughly sorted
by the variance explained, was useful in the second implemented heuristic visualization
(not described here) presenting a separate map for each cluster, centered on that clus-
ter. Both visualizations have been designed to make it easy to spot the nearest movies (I
presumed that the user is interested primarily in searching for similar movies).
170
Besides visualization, distances can also be used directly to make predictions, as in the
described earlier (section 4.5) K-NN methods and kernel methods.
The first of the interactive visualizations (the description of the second one will be
skipped) is a map of 2000 movies grouped in 487 clusters. Clusters are displayed as circles
with the diameter proportional to the cubic root of the cluster size. The visualization
is interactive – after selecting a movie in the Flash application, all clusters are colored
according to the distance from the chosen cluster, and additionally lines are drawn to four
nearest clusters. As the movie-movie similarity (distance), the cosine similarity between
vT v
cluster features was used: sij = ||vi ||i2 ||vjj ||2 . The cluster features vi are averages of the
normalized features of all movies included in the cluster. I stored their normalized versions
in the application: xi = ||vvii||2
The visualization task, which we set ourselves, is to choose one of the possible ways
of representing 32-dimensional vectors in two dimensions, in a way that the objects close
in the 32-dimensional were also close on the plane. Such types of visualization tasks are
often formulated as force-driven algorithms (calles also force-directed, force-based), that is,
forces acting on each point are chosen, and the resulting dynamical system is simulated, for
example, in discrete time increments. The item-item distances in the 32-dimensional SVD
space can be changed to forces, for example, according to Hooke’s law, as if the items were
connected with springs, and too close distances on a plane can be penalized, for example,
by using forces modelled after electrical repulsion by Coulomb’ law. Force-directed graph
drawing was used to visualize the Netflix data in [Ste10a], where the graph was defined by
similarities calculated directly from ratings, and in the visualization [Gan09], where the
graph was defined by 10 nearest neighbors according to the matrix factorization similarity.
Commonly used alternatives to force-driven methods are multidimensional scaling
methods, and spectral embeddings – see visualizations of similarities between music artists
in [Gle06] or visualizations of graphs (networks) in other domains [New03].
I used a simpler, heuristic iterative approach, where each iteration consisted of three
phases: In the first phase, for each subsequent cluster, its coordinates are updated to
the weighted average of four closest points according to the distance sij , with weights
dij = exp(10sij ), where sij is the cosine similarity between the feature vectors of clusters
i and j. The second phase is the repulsion between points, where the vector of change
is vij = dˆ−3 ˆ
ij , where dij is the Euclidean distance between clusters i and j on the plane.
In the third phase, the coordinates are scaled and shifted, so that they will fit within a
specified part of the plane. The resulting visualization, selected from one of several runs
of the algorithm, is shown on the plot 35.
The visualization balances between the requirements: items similar according to sij
should be close in the visualization, but items should not be too close to each other, and
the resulting set of points should be diverse, not too monotonous (visualization is more
appealing, and also more useful, when it consists of varied, easy to distinguish shapes,
rather than when it consists of repetitive shapes).
A downside of the chosen method of visualization is that unrelated clusters are often
close to each other, for example, the cluster including “Dawn of the Dead” is close to
“Friends” and “Pretty Woman”. This can be partially fixed by adding a penalty to the cost
function for closeness of non-similar clusters, and the obtained more complex cost function
could be optimized with simulated annealing, which has the capability of escaping local
minima. Another way to avoid local minima is to initialize the cluster positions with a
result of another algorithm that would properly deploy large groups of similar clusters (for
example, we could create 10-20 groups of clusters using an additional run of the k-means
algorithm).
171
2D visualization of 2000 most popular movies, grouped in 487 clusters
South Park: Bigger, Longer and Uncut Friends: Season 2 3
Family Guy: Vol. 1: Seasons 1−2 Friends: Season 1
Lord of the Rings: The Two Towers Forrest Gump
The Simpsons: Season 3 The Best of Friends: Vol. 1
Lord of the Rings: The Fellowship of the Ring The Green Mile
Family Guy: Vol. 2: Season 3 Friends: Season 3
Lord of the Rings: The Return of the King Rain Man
The Simpsons: Season 1 Friends: Season 4
Star Wars: Episode V: The Empire Strikes Back As Good as It Gets
The Simpsons: Season 2 The Best of Friends: Season 1 The Patriot
Star Wars: Episode VI: Return of the Jedi Cast Away
The Simpsons: Season 4 The Best of Friends: Season 2 Gladiator
Star Wars: Episode IV: A New Hope Dances With Wolves
South Park: Season 2 Friends: Season 5 The Last Samurai
Lord of the Rings: The Two Towers: Extended Edition The Best of Friends: Season 3
The Simpsons: Season 5 Braveheart
The Lord of the Rings: The Fellowship of the Ring: Extended Edition Friends: Season 6
South Park: Season 1 Saving Private Ryan 2
Lord of the Rings: The Return of the King: Extended Edition Friends: Season 7
The Simpsons: Treehouse of Horror Black Hawk Down
South Park: Season 4 Friends: The Series Finale We Were Soldiers
South Park: Season 3 Charlies Angels The Best of Friends: Season 4 U−571
Basic Instinct
Futurama: Vol. 1 The Matrix Friends: Season 8 Enemy at the Gates
Charlies Angels: Full Throttle
Family Guy Presents: Stewie Griffin: The Untold Story● The Matrix: Reloaded Friends: Season 9
● ●
South Park: Season 5 ●
The Matrix: Revolutions The Best of Friends: Vol. 2
● ● ●
● The Matrix: Revisited The Best of Friends: Vol. 4
Simpsons Gone Wild ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
Futurama: Vol. 2 ● The Best of Friends: Vol. ● ● ● ● ●●
●
●● ● 3●
● ● ● ●
● ● ● ●
● ● ● ● ●● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● 1
● ●
●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ●
●●
●
My Best Friends Wedding ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ●
● ●
● Notting Hill ● ● ● ● ●
● ● ● ● ●
● ●
● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
●
●
●● ● ●
●● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
Legally Blonde ● ● ● ● ● ●● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●● ●
Mean Girls ●
●
●
●
● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●
●● ●
●● ●
● ● ● ● ● ● ● ● ●
Clueless ● ● ● ● ● ● ●
● ● ● ●
●● ● ● ●
● ● ● ● ● ● ● ● ●Dawn of the Dead
● ● ● ● ● ● ● ● ● ●
Bring It On ● ●
● ● ●
● ● ● ● ● ●
● ●
From Dusk Till Dawn
● ● ● ● ● ● ● ●
Romy and Micheles High School Reunion ● Titanic ●
● ● ● ●
● ● ● Army of Darkness
● ● ● ●
Pretty Woman 0
● ● ● ● ●
● ● ● Night of the Living Dead
● ●
● ● ● ● ● Dirty Dancing
● ●
● ● ●
● ● ● ● The Evil Dead
The Day After Tomorrow ● ● ● ● Ghost
●
● ● ● ● ●
● ● Evil Dead 2: Dead by Dawn
National Treasure ● ●
● ● Erin Brockovich
● ● ● ● The Texas Chainsaw Massacre Y
Paycheck ● Steel Magnolias
● ●
172
Firefly ● ●
● ● Cabin Fever
Cellular ● ●
Grease
● ● Dawn of the Dead
Daredevil ● Stepmom
I, Robot ● ●
●
Sahara ● ● A League of Their Own
The League of Extraordinary Gentlemen ● Mad Max
The Punisher Dogma ● ● ● Fried Green Tomatoes
Van Helsing ● ● Highlander
Catwoman Clerks ● Beaches
The Chronicles of Riddick ● The Road Warrior −1
Chasing Amy ● ●
Elektra ● ●
My Girl
Alien vs. Predator ● Big Trouble in Little China ● ●
The Core Jay and Silent Bob Strike Back ● ● Footloose: Special Collectors Edition
Blade: Trinity ● Tron ● ●
The Flight of the Phoenix Mallrats ● ● Mystic Pizza
The Scorpion King ●
Mad Max Beyond Thunderdome ● ●
Clerks: First Cut (Not Theatrical Version) ● ●
After the Sunset ● ●
Annie: Special Anniversary Edition
Clerks: Uncensored ●
● Escape from New York
●
Assault on Precinct 13 ● ●
● Conan the Destroyer ●
● ● ● ●
Timeline Punch−Drunk Love ● ●
Logans Run ●
●
●● ● The Godfather ●
Taxi Magnolia ●● ● ● ● ● ● ● ●
●● ● ● Conan the Barbarian ●
● GoodFellas: Special Edition ●
The Life Aquatic with Steve Zissou ● ● ● ● ●
Mr. 3000 ● ● ●
● ●
● ● ● Excalibur ●● ● ●
● ● ● The Godfather, Part II
●●
Man of the House I Heart Huckabees ● ● ● ●
● ● ● ●
Pulp Fiction ● ● Platoon ● −2
XXX: State of the Union Dogville ●● ●
● ● ●
Kill Bill: Vol. 1 ● ● Unforgiven ●
Figure 35: 2D visualization of movie clusters.
Waking Life
28 Days Later Fight Club ● Heat: Special Edition
●
Anacondas: The Hunt for the Blood Orchid One Flew Over the Cuckoos Nest
Shaun of the Dead Kill Bill: Vol. 2 Requiem for a Dream Wall Street
Torque Apocalypse Now
Donnie Darko: Directors Cut The Usual Suspects American Psycho The Color of Money
Pi: Faith in Chaos Reservoir Dogs Taxi Driver
The Rules of Attraction The Truman Show
The Machinist Snatch A Clockwork Orange
Narc Sin City Full Metal Jacket
Bubba Ho−Tep City of God 2001: A Space Odyssey
Primer Monty Python and the Holy Grail Apocalypse Now Redux
−3
Trainspotting: Collectors Edition
Blazing Saddles Deliverance
Lock, Stock and Two Smoking Barrels
History of the World: Part 1 The Deer Hunter
True Romance
Monty Pythons Flying Circus Raging Bull
Trainspotting
And Now for Something Completely Different Easy Rider
Amores Perros
The Producers Dog Day Afternoon
Layer Cake
−2 −1 0 1 2
X
6.2.3 2D recommender system
We can look for alternative ways to present recommendations than displaying a simple
list. One possibility is presenting recommendations by displaying points or marking areas
on a plane. The intention is that it can be easier for a user to remember a point on a
map, than to traverse a complicated structure of catalogues, lists, tags, or even to use a
text search. In a good 2D visualization points close on a map should roughly correspond
to similar movies. It can be expected, that the user ratings in groups of close movies will
be similar, making it easy for a user to notice regions of the 2D map, where lie the movies
liked by him. The advantage for recommendations is also the potential capacity of 2D
visualizations, the possibility of presenting a lot of items on a small space. For example,
the Flash application described here displays simultaneously 487 clusters.
In the Flash sub-application “2D recommender system”, the visualization of clusters
(described in the last section 6.2.2) was augmented by recommendations. Recommenda-
tions are presented in two ways: coloring all points-clusters according to the predicted
rating, and, after clicking a button, explicitly marking on the map the clusters from the
list of top recommendations, in groups of 10 clusters. Picture 36 shows a screenshot of the
“2D recommender system”.
107
The recommendations are server-less, that is, user preferences and recommendations
are calculated inside the application, without queries to a specialized server. The recom-
mender system is SVD-based: the regularized SVD model is trained once with Variational
Bayes (see section 4.3.3), one feature at a time, and the resulting features are stored
in the Flash application, one normalized averaged vector per cluster (see the previous
section 6.2.1 “Clustering”). Inside the flash application, user preferences are calculated
with a regularized SVD with constant-linear regularization, learning one feature at a
time by a jump to the marginal minimum of the cost function (the method described
in section 4.3.2). To produce recommendations, items are sorted according to the formula:
ci + dj + uTi vj − 10ŝj − Cij . The parameter ŝj is the estimated standard deviation of
predictions for the cluster j (or rather the difference between the standard deviations in
√
different clusters), crudely approximated as αj , where αj is the average of the values
(Nj2 + 10)−1 in cluster j, Nj2 being the number of ratings given to a movie j2 . The Cij
term is a heuristic correction for the missing data structure, giving more penalty to the
prediction of ratings that are more likely to be missing: Cij = 0.7 − max(0, ρij ), where ρij
173
is the similarity (cosine similarity between two vectors of features) to the nearest cluster j
rated 4 or 5 by user i. User preferences and recommendations are recalculated and shown
instantly at the moment when the user rates an item. Ratings given by the user do not
influence the item features, that means, item features are fixed all the time.
Other corrections for the missing data structure are possible. An untested heuristic
correction is qusing Mahalanobis distance to the set of vectors vj2 rated 4 or 5 by the
user: Cij = α (vj − µi )T Si−1 (vj − µi ), where µi , Si are the parameters of the estimated
multidimensional Gaussian distribution approximating the distribution of points vj2 . The
constant α can be hand-picked to produce best recommendations, as evaluated empirically.
Instead of using the Cij adjustment term, another way to correct the method may be
adding a small number of low ratings for each user, for items drawn uniformly at random
[Ste10b] (a disadvantage of this method is that it penalizes unpopular items), or for items
drawn according to their distribution in the data [Dro11], still leaving the risk of adapting
too much to the structure of missing data, which may be not entirely related to the user
preferences. Ideally, recommendations should be tested on random items that users have
to rate or indicate that they do not know them (the Netflix dataset does not contain such
data).
The accuracy of recommendations in the application could be further improved to a
small extent by predicting user preferences using the missing data structure (see the algo-
rithm SVD++, section 4.3.5) and by using the time information (section 4.3.6). Because
movies are grouped into clusters, and the recommendations are for whole clusters, the al-
gorithm could be improved by incorporating the clustering within the model, and learning
the features directly for the whole clusters, instead of postprocessing the regularized SVD
results and averaging the SVD features for groups of movies.
Observing visualizations of predicted ratings in the application I noticed that some-
times giving an extreme rating (low or high) does not influence enough the predictions for
similar items, even when no ratings are given for similar items. It indicates that the used
regularized SVD algorithm, and in particular the one-dimensional ridge regression used
to infer a user preference for a feature, may be too robust to outliers. A remedy could be
optimizing a different loss function than MSE, with a higher penalty for large errors, for
example |r̂ij − rij |p with p = 3 or p = 4, but I have not tested this idea.
Several additional features were implemented in the application: search by title, fil-
tering by genres, including new genres defined by the six largest SVD features (see sec-
tion 4.3.4), recommendations shared for two persons, calculated with a heuristic that
combines two sets of ratings, and the option of importing ratings from the services Netflix
and IMDb.
A similar idea of visualizing recommendations on a map for 1000 most popular TV
shows was described in [Gan09]: the map is generated dynamically based on user prefer-
ences (inferred from the data about watched TV shows), and based on which TV shows
are available. The visualization in [Gan09] was based on force-directed graph drawing of
a network of 10 closest items according to similarities taken from a factorization model
built on users’ preferences for TV shows. In [Gan09] the following ways of recommendation
were proposed: a user-driven way, where a user can explore a heat map, coloured based
on the user’s preferences, and a user-passive way, where recommendations proposed by
the system are marked on a map, together with an explanation by indicating the watched
similar TV shows. Comparing the Flash application described here to the approach in
[Gan09], they differ in: the method of visualization (here movies are clustered by k-means,
clusters are displayed as single points, and the 2D map is heuristically generated), in the
method of generating and visualizing recommendations (server-less instant recommenda-
tions, coloring points according to user preferences, explicit recommendations by marking
ten points-clusters), and in the user interface and available features.
174
6.3 Beyond recommendations
The task of predicting missing values in a sparse matrix appears also in domains other
than recommender systems, and it is justified to expect that for some of those datasets and
tasks effective will be collaborative filtering algorithms similar to the analyzed in this work.
Particularly interesting is the possibility of using dimensionality reduction algorithms.
Examples of successful applications include using collaborative filtering for medical data
[Has10], educational data [Tos10], image processing, or text analysis.
I implemented collaborative filtering algorithms for two datasets outside of the field of
recommender systems: for outcomes of soccer matches, and for outcomes of chess games.
Both attempts did not give better prediction than the known models, but the experiments
are worth mentioning, because such negative results with overly flexible models to some
extent support hypotheses, that the best known models are close to the optimal.
The first dataset was soccer matches from years 1994-2009 from English Premier Divi-
sion and English Divisions I and II. As a baseline model was used the proposed in [Hav97]
model Hi ∼ P oisson(b + ai − dj ) and Aj ∼ P oisson(aj − di ), where Hi are the goals of
the team i playing home, and Aj are the goals of the team j playing away. The parameter
b models the home field advantage. The parameters ai can be understood as the attack
skill of team i and the parameters di as the defense skill of team i. The attack and defense
parameters change in time, and the change is well modelled by exponential smoothing. In
my implementation the parameters were learned by alternating point estimation using ex-
ponentially smoothed, regularized Poisson regression. The resulting accuracy was slightly
worse than the accuracy of bookmakers’ odds (included in the dataset), with a remark
that the implemented model used only the variable of goals, without using any additional
variables, such as shots on goal.
The tried out collaborative filtering model for soccer matches had the form: Hi ∼
P oisson(b + uTi vj ), Aj ∼ P oisson(uTj vi ), learned by alternating regularized Poisson re-
gression, with exponential smoothing in time. A simplification made here was using a
constant regularization parameter (as the experiments with the Netflix Prize show, when
hidden variables are combined by multiplication, it is better to use constant-linear regu-
larization, or an approximate Bayesian approach, such as Variational Bayes or MCMC;
also, more accurate can be treating the hidden lambda parameter as a random variable
with Gaussian noise). In the experiments, using more than two features did not improve
accuracy, and the two-feature model had about the same accuracy as the baseline model
of two biases. It is an argument supporting the hypothesis that the model of two biases is
close to the unknown optimal model, and no additional multiplive components are needed
(an interpretation is that soccer teams cannot be divided into groups with largely different
way of play, as we can do with movies by distinguishing movie genres). Of course when
modelling real-life data, we can never entirely exclude the possibility that we overlooked
some important pattern in the data.
The second modelled dataset was the Kaggle dataset of 65,053 chess games by 8,631
players, selected among world’s highest rated 13,000 chess players. Several algorithms can
be regarded as baseline algorithms for this task: ELO, Glicko [Gli99], Chessmetrics or
TrueSkill [Her07]. The model I have tried out had the form: p(Yij = 1) = 1/(1 + exp(b +
ci − cj + ui vj − uj vi )), where Yij is the game result, b is a global parameter accounting for
white player’s advantage, ci bias variable corresponding to the one-dimensional playing
strength of player i, and ui , vi create the “collaborative filtering” term, added in order to
see if an additional dimension of playing strength helps. The parameters were learned with
alternating regularized logistic regression with a special prior depending on the number
of matches played, depending also on a weighted average of opponents’ ratings, and ex-
ponentially smoothed in time. It turned out, that the feature ui , vi does not improve the
accuracy of the model. It is an argument, that the right approach is using only one variable
175
(one ranking) to model the playing strength of a player. The used data were released in a
prediction contest organized by Kaggle, which was followed by another contest with 1.84
million games of 54,000 chess players. The best models in both contests contained one
ranking variable per player (though with additional parameters and effects).
In [Sta11] models with 1-15 factors were used to predict results of games of go, with
non-conclusive results, whether increasing the number of factors improves prediction.
A question remains open, whether in any popular sport collaborative filtering models
can be useful to model the playing strength (perhaps in poker?). In such sports cyclic
relations should occur, such as: the team A is better than the team B, the team B is
better than the team C, and C is better than A. If we do not observe such situations,
it is likely that the optimal modelling of results will be limited to models of biases, as in
chess (one main variable per player – the player’s skill) or in soccer (two main variables
per team – the attack skill and the defense skill).
Interesting would to check if models with dimensionality reduction are useful for mod-
elling stock market data or currency data.
176
“Science and technology provide the most important
examples of surrogate activities”
Theodore Kaczynski
7 Summary
I discussed, in the context of recommender systems, the most accurate collaborative filter-
ing methods predicting ratings, I summarized own experiments, and experiments of others
on the Netflix Prize dataset. The focus of the experiments was on obtaining possibly best
accuracy in the associated prediction task of minimizing RMSE. The large size of the
dataset (100M ratings) makes it possible to compare accuracy of a large number of meth-
ods, allows to use methods with more parameters, and gives a larger capability to identify
the underlying probabilistic structure that generated the data, than it is possible for typ-
ical publicly available datasets, most being of size 100-10000 observations. Since making
the Netflix Prize dataset available, many people worked independently on the same pre-
diction task (the contest website states that over 5000 teams submitted their solutions),
and as the result the Netflix Prize task has been exceptionally well analysed relatively
to other prediction tasks on other datasets. A thorough analysis of one prediction task
gave methods and insights that are useful not only for collaborative filtering recommender
systems, but should generalize well on other prediction tasks in other domains. No need
to explain that the need for accurate and simple methods, and time-efficient approaches
for data analysts appears in a huge number of applications, and it would be good if large
needs were matched with an in-depth understanding of the field of prediction.
The task proposed by Netflix was to minimize RMSE between predicted ratings and the
real observed ratings on a hold-out set. The task was a compromise between simplicity and
being relevant for developing recommender systems. The choice of RMSE as the evaluation
measure caused that our task of analysing a real-life dataset was relatively “clean”, while
being a good intermediate step in the application to serve lists of recommendations to
users. It turned out that there is depth in the simply formulated task of predicting ratings,
and the challenge we set ourselves, of developing possibly most accurate methods, contains
certain necessary complexity, which I attempted to explain in this work.
Summarizing shortly the state of understanding of the Netflix task, the most accurate
methods combined modelling the structure of the data on three scales – dividing according
to the number of parameters: the global scale, including global effects such as biases, the
middle-level scale with dimensionality reduction structure (here the most accurate models
were SVD, RBM and KRR), and the most parameterized scale explaining direct item-item
relations, by neighborhood models, and also by kernel methods not using dimensionality
reduction, such as NPCA. In the most accurate methods, each of the three scales included
time-dependent variables, which captured the model variability over time. In addition, a
significant improvement of accuracy was obtained by using the structure of missing data,
that allowed to some degree to predict the user preferences before seeing the user ratings.
In the work I studied closer the middle-level scale containing the most visible devel-
opment coming from the Netflix contest – the structures with dimensionality reduction,
on which the most accurate methods predicting ratings were based. In particular, the
developed matrix factorization methods are typically more accurate and faster than the
earlier applied [Mon03] near-neighbor-based collaborative filtering recommender systems,
and already resulted in numerous applications. It should be noted that datasets in different
domains do not always contain a structure with reduced dimensionality, but it seems that
in datasets with gathered users’ preferences for items such structure with low number of
dimensions (but more than just global effects) usually exists, intuitively understood as a
177
division of items into groups that a user can like or dislike, for example, the division of
movies into genres.
Among the most accurate models (MF/SVD, RBM, KRR, NPCA), the ones most
commonly used were based on matrix factorization methods, that is the models contain-
ing sums of multiplied variables: a continuous variable representing an item feature (an
automatically learned movie genre) multiplied by a continuous variable expressing a user
preference for that feature. In this study, in particular, I examined closer the variants of
MF that learn one variable at a time, called here regularized SVD. Two basic ways of learn-
ing the parameters were described: neural networks approach with special regularization,
and the approximate Bayesian approach, as Variational Bayes or MCMC. An observation
important for the whole domain of neural networks is that multiplying hidden parameters
necessitates the use of another regularization than the standard, constant weight decay –
necessary is the amount of regularization growing linearly with the number of observations
[Fun06, Lim07, Rai07].
It can be disputed, whether the usual probabilistic matrix factorization with Gaussian
priors is the proper hidden structure to model movie ratings (or to model any other data).
For example, the VB approximation gives posterior Gaussian distributions, meaning that a
rating is approximated by a sum of products of two Gaussian variables, which seems to be
a quite unnatural choice. Experiments with nonparametric priors were performed, which
dispute the choice of Gaussian prior for the hidden parameters of the matrix factorization.
Also, experiments were done with learning a nonparametric relationship predicting the
rating, which dispute the choice of multiplication as the operation connecting the hidden
parameters (this is redundant with changing the priors). The experiments confirmed in-
accuracies of the probabilistic model of matrix factorization, but have not led to changes
that visibly improve the accuracy.
We can say about each real-life dataset, that it was generated by a certain unknown
model. Data analysis tasks, in particular prediction tasks, rely heavily on the identifica-
tion of the unknown model, that means on discovering the right structure of parameters,
dependencies, patterns, probability distributions, effects in data – identifying all elements
relevant for the purpose of the analysis (in our case, obtaining the best prediction accu-
racy). An approach fully Bayesian would be to impose a prior distribution, specifying the
probability of every possible model, and to apply Bayes’ rule to infer the posterior prob-
ability distribution over models. Such two-step attempts with defining all data analyst’s
prior beliefs are impractical (even assuming, that we could perform the Bayesian inference
exactly, without computational difficulties). In practice, experience shows that the right
approach is to iteratively refine the model, based on how well it fits the data (or based
on the predictive accuracy), hence the data is used multiple times to exclude implausible
models, and we concentrate the effort on exploring plausible models, supported by the
data. We can say that we need to optimize, on the meta-level, the amount of analyst’s
time and effort put, needed to identify well enough the model that generated the data,
or to find a method accurate enough for the given purpose (the decision criterion in pre-
diction can be optimized directly, without the intermediate step of building a generative
or discriminative probabilistic model). The experience from the Netflix contest was that
a convenient framework to search the set of possible models was to blend many models
(in the Netflix task blending with linear regression or ridge regression worked well), which
gives a combination of methods with accuracy much better than the accuracy of individual
models. Blending allows to assess contribution of each method, remove unnecessary meth-
ods, and focus on those that improve accuracy of the ensemble. When a method improves
the blend, it suggests us that there exists a way to integrate this method with one or more
remaining methods, creating another method with better accuracy. The best accuracy in
the Netflix Prize task was reached by ensembles containing hundreds of models.
178
In the process of searching for the best models useful was automation, mostly for
automatic parameter tuning by minimizing the hold-out test set accuracy (I used for tuning
the Praxis procedure from the Netlib library). A well performing technique, unused in this
work, was choosing the parameters to optimize the blend accuracy, instead of optimizing
the accuracy of individual methods [Tos09, Pio09]. An open issue is how to automatically
search (test) large classes of models, (for example, a class of multitask models with one
layer of hidden variables, like models MF, RBM, PLSA, and similar), or to test a large set
of possible effects. The larger the dataset, the greater are the possibilities of automation,
and for the dataset of the size of Netflix’s, automatic testing of plausible classes of models
and effects should be to some degree possible to realize. This is a topic for future research.
Is fully automated model identification possible? I am skeptical about it – in real-life tasks
domain knowledge always will be needed to limit the range of tested models. Interesting
was absence of methods based on decision trees in ensembles for the Netflix Prize (except
for the use for blending predictions). Decision trees seem to work better as a black-box
method for small datasets with lots of predefined features, when a quick analysis is needed,
without precisely identifying the underlying model.
We can hypothesize about, to which degree (thanks to the effort of many people)
the optimal model has been discovered for the Netflix dataset and task. Looking at the
best combinations of methods in the greedy feature selection in the table 41 it can be
supposed, that, if some important method did not remain undiscovered, the unknown
optimal model, about which we can say that it generated the data (or rather the part of
data p(rating|user, movie, time), which we want to predict), is some combination of SVD,
RBM, KRR, and neighborhood models, with global effects, improved user preferences, and
effects explaining the model variability in time.
In most situations of estimation or inference encountered in the Netflix Prize task, as
well as in other prediction tasks, necessary is an appropriate regularization. The methods
of the classical statistics, such as point estimation with uniform priors (the maximum
likelihood method), are inexact, result in a non-optimal prediction. What’s more, the
experiments with approximate Bayesian approaches, as MCMC and VB, show that the
whole idea of point estimation of posterior distributions is sometimes largely inaccurate,
and necessary are corrections, as, in the models with hidden variables for the Netflix
Prize, we needed to use an amount of regularization growing linearly with the number of
observations.
An advantage of the choice of RMSE as the evaluation measure was the simplicity of
the resulting algorithms optimizing it, for example, efficient was blending methods with
the linear regression, or modelling ratings by sums of biases, linear, and bilinear terms.
Having a fixed criterion of accuracy allowed to compare results objectively, which certainly
helped in that the effort put by many people resulted in analysing the task in depth. As
argued in the work, predicting ratings by minimizing RMSE is a good intermediate step
to a useful real-life goal, calculating good quality lists of personalized recommendations.
Recommender systems differ, and multiple issues may be important as the goal of rec-
ommendations: the type of data collected, diversity on a recommendation list, diversity
in time, balancing popularity vs. novelty, dealing with cold-start situations, resistance to
shilling attacks, integration with with search, categorization, tags, social navigation, ex-
plaining the recommendations. Looking at the above vast list, it is evident that many
aspects of calculating recommendations have to be simplified. The crux of a recommen-
dation algorithm is always to guess in some way the preferences of a user for items, thus
(although not always directly) to predict ratings accurately, but it is not enough to ac-
curately estimate the expected rating. Calculating relevant recommendations is rather a
combination of several machine learning tasks, such as estimating, besides the expected
rating, the uncertainty of predictions, the probability that a rating is missing, the prob-
179
ability that a user watched the movie, the influence of the missing data structure on the
expected rating and on the uncertainty of predictions, predicting the user’s short-term in-
tent using entered search queries, clicked tags, currently visited item page, etc., estimating
similarity between items or users, detecting duplicate items, matching items in different
databases. To generate lists of recommendations in an easy way using algorithms such as
the developed in this work, a well performing method was to sort items by specially ad-
justed expected ratings. If the items in a recommender system have very different support
(this is usually the case), it is necessary to adjust the expected rating by the estimated
standard deviation of predictions. The adjustment is the more important for users with
few ratings, and results in recommending more popular content for those users. If recom-
mendations are chosen from a large set of items (more than thousands of items), another
adjustment depending on the probability of missing data may be needed, because the al-
gorithms minimizng RMSE on the training set distribution give too large rating estimates
for ratings which are likely missing (usually this bias affects items with low predicted rat-
ing, but because of uncertainty of the predicted rating, the more items are in the system,
the more often the items with heightened prediction will enter the recommendation lists
– a situation which we want to avoid; additionally, the uncertainty of predictions made is
higher for the data that is likely missing). A heuristic solution proposed for the adjust-
ment for the missing data bias was to appropriately define and use a distance to the set
of items highly rated by the user. More research and experience is needed to determine
the appropriate criterion of evaluation and comparison of personalized rankings (lists of
recommendations), and to determine if it is worth to optimize such ranking-based criterion
directly. Accurate algorithms optimizing a proper ranking-based criterion probably would
be similar to the developed most accurate methods for the Netflix task (see, for example,
[Jah11b]). To verify it, more experiments are needed on specially gathered new data.
The recommendation method advocated in this work is two-step. First, the regularized
SVD model ci + dj + uTi vj is learned to predict ratings from the observed data distribu-
tion. I preferred to learn the parameters with the Variational Bayesian approach or with
its neural-networks-like simplification with a “constant-linear” regularization term, and I
preferred to fully learn one feature at a time (with learning all features repeated 2-3 times)
to concentrate the learned variability of ratings in the first features. Informative priors can
be used for the user preference parameters uik (for example, based on the implicit feed-
back or on demographics), and for the item features vjk (based on the movie metadata
– improves RMSE accuracy only for items with few ratings). In the second step, having
the model predicting ratings, recommendations for user i are calculated by sorting the
considered items j by the following score (partial sorting to select the top items is used,
of all items, or of a context-based subset, filtered by search, selected category or tags):
The adjustment αsj penalizes the uncertainty of the prediction. The adjustment Cij cor-
rects predictions for the missing data structure, penalizing those items that are unlikely
to be known and rated by the user. Heuristics for sj and Cij were proposed in sections
6.1 and 6.2.3. The obtained list of top items can be postprocessed to ensure more diverse
recommendations, if needed.
I will end with a recommendation how to best divide efforts when faced with similar
tasks of real-life data analysis, to make best use of available data:
• Choosing the right task, 60% efforts – researching what is the proper task, what is
the right problem to be solved, how to collect data (various observational studies
or experimental designs are possible), discover good candidate variables, how to
evaluate results. To concretize the task helpful are: good domain understanding,
observation of users’ interactions with the developed tool or service, conducting
180
surveys among users. In the task chosen to analyze in this work, Netflix has done
a good job with gathering the right kind of data – ratings, which allow a user to
express positive and negative preferences. Ratings are a better type of data, closer
to what user thinks, than passively gathered information, such as clickstream data,
or links between pages, which are the basis of commonly used search algorithms.
Netflix also proposed a good evaluation criterion, relevant to the real-life goal of
preparing good quality lists of personalized recommendations. Of course, measuring
accuracy by hold-out test error is one of the easiest ways to formulate a prediction
task. In practice, it is important to consider how the situation where the data is
gathered is related to the situation where the developed solution is used. When
calculating lists of recommendations, items are evaluated within a different missing
data structure than the structure of training data. To what extent the algorithms
optimizing hold-out RMSE have to be corrected for the missing data structure to
produce recommendations, this is a subject for further research.
• Model identification, 30% efforts – understanding the data, guessing the underlying
probabilistic structure, discovering meaningful effects, patterns, dependencies in the
data, discovering to a sufficient degree the model that could plausibly generate the
data. For example, guessing the variable distribution is a non-obvious task for of
a one-dimensional, directly observed variable. All the more difficult is guessing the
distribution of hidden variables, where, no matter how much data we gather, there is
always substantial uncertainty left, whether we model the hidden structure properly.
We can use techniques such as starting from a nonparametric or overparameterized
model, and then, based on the learned shape, to choose the right parametric model
that overfits the data less than the more flexible, nonparametric models. Here a
good idea is to exploit emerging situations of multi-task learning, where the similarity
between tasks is captured by a common prior that can be assumed in a nonparametric
form. The present state-of-art is that model identification is largely a trial-and-error
methodology, where methods are chosen based mainly on experimental experience,
and rarely on a systematic search. We should use domain knowledge in the process,
but intuitions about what is, or is not important for the task, are often misleading,
and we should always verify if the emerging hypotheses agree with what is observed
in the data. An open issue is, to what extent the process of model identification
can be automated (automatic methods exist for the simplest situations, such as
the automatic feature selection in linear models with few predictors). In the Netflix
contest automatic parameter tuning was extensively used by many teams – but it
stays an open question, whether it is possible to automatically test a large group
of effects, or a wide class of possible hidden structures. In the machine learning
community some advocate approaches closer to the fully Bayesian one, imposing
prior on all possible models, but such approaches are rather impractical. It is better to
gradually, as experiments are performed and the data becomes better understood, to
reject implausible models, and to limit the class of considered models. A more time-
efficient framework for initial model identification, than working with fully specified
probabilistic models, seem to be neural networks (even after almost three years of
the Netflix contest, a large majority of methods in the most accurate ensembles
were based on neural-networks-like gradient descent optimization of regularized cost
functions).
• Implementation, 10% efforts – choosing the right method optimizing the parameters
with respect to the determined evaluation criterion. An example can be the matrix
factorization model in the Netflix task. Knowing the general plausible form of the
model ci + dj + uTi vj , questions appear, whether to use an approximate Bayesian
181
method, such as MCMC with Gibbs sampling or Variational Bayes, or to use the
simple MAP estimation (which has not worked well for the variables uik multiplied
by vjk ), or, instead of defining a probabilistic model, to decide on simpler algorithms
in a neural networks approach, with proper regularization (nonstandard, linear reg-
ularization is needed for the SVD model in the Netflix task). Other questions are
whether to learn the parameters one at a time or whole groups of parameters, and
whether to optimize the appearing numerical criterion using a first-order method like
gradient descent (stochastic or batch learning, the momentum method, or perhaps
conjugate gradient), or a second-order method, by jumping directly to the minimum
(of one parameter or of a group of parameters) in case of a quadratic cost function.
Similar questions appear when implementing models for other datasets and tasks.
The listed three phases may alternate. Findings from subsequent phases may help in
the earlier ones, perhaps leading to redefining the task and the evaluation method. The
whole process of data analysis may need to be repeatedly iterated.
The prediction contest ended by crossing the arbitrary level of 10% accuracy improve-
ment over the reference algorithm, but questions remain: what is the best accuracy possible
to attain? Which probabilistic model generated the data? Which dimensionality reduction
method is the right one (inside the unknown, optimal model)? Despite the remaining ques-
tions, the overall conclusions should hold. To model various aspects of the real world on
the basis of gathered data it is useful and often inevitable to use methods and approaches
similar to the described in this work. Experiences with tasks like the Netflix Prize are part
of the journey approaching the optimal process of data analysis.
182
References
[Ada11] Panagiotis Adamopoulos, Alexander Tuzhilin. On Unexpectedness in Recom-
mender Systems: Or How to Expect the Unexpected, RecSys, 2011.
[Ado11] Gediminas Adomavicius, Jesse Bockstedt, Shawn Curley, Jingjing Zhang. Recom-
mender Systems, Consumer Preferences, and Anchoring Effects, Workshop on Human
Decision Making in Recommender Systems, RecSys, 2011.
[Aga09] Deepak Agarwal, Bee-Chung Chen. Regression-based Latent Factor Models, KDD,
2009.
[Aga10] Deepak Agarwal, Bee-Chung Chen. fLDA: Matrix Factorization through Latent
Dirichlet Allocation, WSDM, 2010.
[Ali04] Kamal Ali, Wijnand van Stam. TiVo: Making Show Recommendations Using a
Distributed Collaborative Filtering Architecture, KDD, 2004.
[Ama09] Xavier Amatriain, Nuria Oliver, Josep M. Pujol, Nava Tintarev. Rate it Again:
Increasing Recommendation Accuracy by User re-Rating, RecSys, 2009.
[Bak03] Bart Bakker, Tom Heskes. Task Clustering and Gating for Bayesian Multitask
Learning, JMLR 2003.
[Bel07c] Robert M. Bell, Yehuda Koren, Chris Volinsky. The BellKor solution to the Net-
flix Prize, 2007.
[Bel07d] Robert M. Bell, Yehuda Koren. Scalable Collaborative Filtering with Jointly De-
rived Neighborhood Interpolation Weights, ICDM, 2007.
[Bel07e] Robert M. Bell, Yehuda Koren. Lessons from the Netflix Prize Challenge,
SIGKDD Explorations, 2008.
[Bel07f] Robert M. Bell, Yehuda Koren, Chris Volinsky. Chasing $1,000,000: How we won
the Netflix Progress Prize, Statistical Computing & Graphics, 2007.
[Bel08] Robert M. Bell, Yehuda Koren, Chris Volinsky. The BellKor 2008 Solution to the
Netflix Prize 2008.
[Ben07] James Bennett, Stan Lanning. The Netflix Prize, Proc. KDD Cup and Workshop,
2007.
183
[Bic07] Steffen Bickel, Michael Bruckner, Tobias Scheffer. Discriminative Learning for
Differing Training and Test Distributions, ICML 2007.
[Bic09] Steffen Bickel, Michael Brückner, Tobias Scheffer. Discriminative Learning Under
Covariate Shift, JMLR 10, 2009.
[Bla92] Fischer Black, Robert Litterman. Global Portfolio Optimization, Financial Anal-
ysis Journal, 1992.
[Ble03] David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation,
JMLR 3, 2003.
[Bra02] Matthew Brand. Incremental singular value decomposition of uncertain data with
missing values, European Conference on Computer Vision, 2002.
[Bra03] Matthew Brand. Fast online SVD revisions for lightweight recommender systems,
SDM, 2003.
[Bre71] Richard P. Brent. Algorithms for Finding Zeros and Extrema of Functions Without
Calculating Derivatives, PhD thesis, 1971.
[Bry07] E. Brynjolfsson, Y.J. Hu, D. Simester. Goodbye Pareto Principle, Hello Long Tail:
The Effect of Search Costs on the Concentration of Product Sales, 2005-2007.
[Bur02] Robert Burke. Hybrid Recommender Systems: Survey and Experiments. User
Modelling and User-Adapted Interaction, 2002.
[Bur05] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamil-
ton, Greg Hullender. Learning to Rank using Gradient Descent, ICML, 2005.
[Cai08] Jian-Feng Cai, Emmanuel J. Candes, Zuowei Shen. A Singular Value Thresholding
Algorithm for Matrix Completion, 2008.
[Can02] John Canny. Collaborative Filtering with Privacy via Factor Analysis, SIGIR,
2002.
[Can09] Emmanuel J. Candes, Terence Tao. The Power of Convex Relaxation: Near-
Optimal Matrix Completion, 2009.
[Car00] Bradely P. Carlin, Thomas A. Louis. Bayes and empirical Bayes methods for data
analysis, Chapman and Hall, 2000.
184
[Cha11] Gil Chamiel. Utilising Structured Information for the Representation and Elici-
tation of User Preferences, PhD thesis, 2011.
[Che11a] Po-Lung Chen, et al.. A Linear Ensemble of Individual and Blended Models for
Music Rating Prediction, KDD Cup and Workshop, 2011.
[Che11b] Tianqi Chen, Zhao Zheng, Qiuxia Lu, Weinan Zhang, Yong Yu. Feature-Based
Matrix Factorization, Technical report, 2011.
[Cho11] Sean Choi, Ernest Ryu, Yuekai Sun. Yelp++ : 10 Times More Information per
View, 2011.
[Cic09] Andrzej Cichocki — Rafał Zdunek, Anh Huy Phan, Shun-ichi Amari. Nonnega-
tive Matrix and Tensor Factorizations. Applications to Exploratory Multi-way Data
Analysis and Blind Source Separation, Wiley, 2009.
[Dau06] Hal Daumé III, Daniel Marcu. Domain Adaptation for Statistical Classifiers, Jour-
nal of Artificial Intelligence Research, 2006.
[Dau07] Hal Daume III. Frustratingly Easy Domain Adaptation, ACL, 2007.
[Del08] Nicolas Delannay, Michel Verleysen. Collaborative filtering with interlaced gener-
alized linear models, Neurocomputing, 2008.
[Dro11] Gideon Dror, Noam Koenigstein, Yehuda Koren, Markus Weimer. The Yahoo!
Music Dataset and KDD-Cup’11 KDD Cup, 2011.
[Far06] Julian Faraway. Extending the Linear Model with R, CRC Press, 2006.
[Faz01] Maryam Fazel, Haitham Hindi, Stephen P. Boyd. A Rank Minimization Heuris-
tic with Application to Minimum Order System Approximation, American Control
Conference, 2001.
[Faz03] Maryam Fazel, Haitham Hindi, Stephen P. Boyd. Log-det heuristic for matrix rank
minimization with applications to Hankel and Euclidean distance matrices, American
Control Conference, 2003.
[Fun06] Simon Funk. Netflix Update: Try This at Home, http://sifter.org/ si-
mon/journal/20061211.html, 2006.
[Gan09] Emden Gansner, Yifan Hu, Stephen Kobourov, Chris Volinsky. Putting Recom-
mendations on the Map – Visualizing Clusters and Relations, RecSys, 2009.
[Gauss1809] Johann Carl Friedrich Gauss. Theoria Motus Corporum Coelestium in sec-
tionibus conicis solem ambientium (Theory of the motion of the heavenly bodies mov-
ing about the sun in conic sections), 1809, English transl. by Charles H. Davis 1857.
185
[Gel11] Andrew Gelman, Cosma Rohilla Shalizi. Philosophy and the practice of Bayesian
statistics, 2011.
[Gel95] Andrew Gelman, Donald B. Rubin. Avoiding model selection in Bayesian social
research, Sociological Methodology, 1995.
[Gil10] Nicolas Gillis, Francois Glineur1. Low-Rank Matrix Approximation with Weights
or Missing Data is NP-hard, JMLR, 2010.
[Gle06] David Gleich, Matthew Rasmussen, Kevin Lang, Leonid Zhukov. The World of
Music: User Ratings; Spectral and Spherical Embeddings; Map Projections, 2006.
[Gli99] Mark E. Glickman. Parameter estimation in large dynamic paired comparison ex-
periments, Journal of the Royal Statistical Society: Series C (Applied Statistics) Vol.
48, Issue 3, 1999.
[Goe10] Sharad Goel, Andrei Broder, Evgeniy Gabrilovich, Bo Pang. Anatomy of the Long
Tail: Ordinary People with Extraordinary Tastes, WSDM 2010.
[Gol01] Ken Goldberg, Theresa Roeder, Dhruv Gupta, Chris Perkins. Eigentaste: A Con-
stant Time Collaborative Filtering Algorithm, Information Retrieval, 2001.
[Gol92] David Goldberg, David Nichols, Brian M. Oki, Douglas . Using collaborative fil-
tering to weave an information tapestry, Communications of the ACM, 1992.
[Har07] Paul Harrison. How to get an RMSE of 0.8937 in the NetFlix Challenge,
http://logarithmic.net/pfh/blog/01176798503, 2007.
[Har11] Morgan Harvey, Mark J. Carman, Ian Ruthven, Fabio Crestani. Bayesian Latent
Variable Models for Collaborative Item Rating Prediction, CIKM, 2011.
[Has09] Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Second Edition, Springer, 2009.
[Has10] Shahzaib Hassan, Zeeshan Syed. From netflix to heart attacks: collaborative filter-
ing in medical datasets, ACM International Health Informatics Symposium, 2010.
[Hav97] Håvard Rue and Øyvind Salvesen. Predicting and retrospective analysis of soccer
matches in a league, 1997.
[Her07] Ralf Herbrich, Thore Graepel. TrueSkill: A Bayesian skill rating system, NIPS,
2007.
[Hid12] Balazs Hidasi, Domonkos Tikk. Enhancing Matrix Factorization Through Initial-
ization for Implicit Feedback Databases, Workshop on Context-awareness in Retrieval
and Recommendation, 2012.
186
[Hil95] Will Hill, Larry Stead, Mark Rosenstein, George Furnas . Recommending And
Evaluating Choices In A Virtual Community Of Use, SIGCHI Conference on Human
Factors in Computing Systems, 2005.
[Hin95] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, Radford M Neal. The wake-sleep
algorithm for unsupervised neural networks, Science, 1995.
[Hof03] Thomas Hofmann. Collaborative Filtering via Gaussian Probabilistic Latent Se-
mantic Analysis, SIGIR, 2003.
[Hof04] Thomas Hofmann. Latent Semantic Models for Collaborative Filtering, ACM
Transactions on Information Systems, 2004.
[Hof99a] Thomas Hofmann, Jan Puzicha, Michael I. Jordan. Learning from Dyadic Data,
NIPS, 1999.
[Hu10] Rong Hu, Pearl Pu. A Study on User Perception of Personality-Based Recom-
mender Systems, User Modeling, Adaptation, and Personalization, 2010.
[Ili08] Alexander Ilin, Tapani Raiko. Practical Approaches to Principal Component Anal-
ysis in the Presence of Missing Values, TKK Reports in Information and Computer
Science, 2008 (also JMLR 2010).
[Jag10] Martin Jaggi, Marek Sulovsk’y. A Simple Algorithm for Nuclear Norm Regularized
Problems, ICML, 2010.
[Jah10] Michael Jahrer, Andreas Töscher, Robert Legenstein. Combining Predictions for
Accurate Recommender Systems, KDD, 2010.
[Jah11a] Michael Jahrer, Andreas Toscher. Collaborative Filtering Ensemble, KDD Cup
and Workshop, 2011.
[Jah11b] Michael Jahrer, Andreas Toscher. Collaborative Filtering Ensemble for Ranking,
KDD Cup and Workshop, 2011.
[Jan10] Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard Friedrich. Rec-
ommender Systems: An Introduction, Cambridge University Press, 2010.
[Kag09] Martijn Kagie, Matthijs van der Loos, Michiel van Wezel. Including Item Char-
acteristics in the probabilistic Latent Semantic Analysis Model for Collaborative Fil-
tering, AI Communications, 2009.
[Kel56] J.L.Kelly. A New Interpretation of Information Rate, The Bell System Technical
Journal, July 1956.
187
[Kon97] Joseph A. Konstan, Bradley N. Miller, David Maltz, Jonathan L. Herlocker, Lee
R. Gordon, John Riedl. GroupLens: applying collaborative filtering to Usenet news,
Communications of the ACM, 1997.
[Kor06] Jacek Koronacki, Jan Mielniczuk. Statystyka dla studentów kierunków tech-
nicznych i przyrodniczych, WNT, 2006.
[Kor09a] Yehuda Koren. Collaborative filtering with temporal dynamics, KDD, 2009.
[Kor09b] Yehuda Koren. The BellKor Solution to the Netflix Grand Prize, 2009.
[Kor10] Yehuda Koren. Factor in the Neighbors: Scalable and Accurate Collaborative Fil-
tering, TKDD, 2010.
[Kor11] Yehuda Koren, Joe Sill. OrdRec: an ordinal model for predicting personalized item
rating distributions, RecSys, 2011.
[Kur07] Miklos Kurucz, Andras A. Benczur, Karoly Csalogany. Methods for large scale
SVD with missing values, Proc. KDD Cup and Workshop, 2007.
[Kwo11] YoungOk Kwon. Computational Techniques For More Accurate and Diverse Rec-
ommendations, PhD thesis, 2011.
[Lai11] Siwei Lai, Liang Xiang, Rui Diao, Yang Liu, Huxiang Gu, Liheng Xu, Hang Li,
Dong Wang, Kang Liu, Jun Zhao, Chunhong Pan. Hybrid Recommendation Models
for Binary User Preference Prediction Problem, KDD Cup and Workshop, 2011.
[Lam04] Shyong K. Lam, John Riedl. Shilling Recommender Systems for Fun and Profit,
WWW, 2004.
[Lat10] Neal Lathia, Stephen Hailes, Licia Capra, Xavier Amatriain. Temporal Diversity
in Recommender Systems, SIGIR, 2010.
[Law03] Neil D. Lawrence. Gaussian Process Latent Variable Models for Visualisation of
High Dimensional Data, NIPS 2003.
[Law05] Neil Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaus-
sian Process Latent Variable Models, Journal of Machine Learning Research, 2005.
[Law09] Neil D. Lawrence, Raquel Urtasun. Non-linear matrix factorization with gaussian
processes, ICML, 2009.
[Lee00] Daniel D. Lee, H. Sebastian Seung. Algorithms for Non-negative Matrix Factor-
ization, NIPS, 2000.
[Lee08] John Lees-Miller, Fraser Anderson, Bret Hoehn, Russell Greiner. Does Wikipedia
Information Help Netflix Predictions?, ICML, 2008.
[Lem05] Daniel Lemire, Anna Maclachlan. Slope One Predictors for Online Rating-Based
Collaborative Filtering, SDM, 2005.
[Lim07] Yew Jin Lim, Yee Whye Teh. Variational Bayesian Approach to Movie Rating
Prediction, Proc. KDD Cup and Workshop, 2007.
188
[Lin03] Greg Linden, Brent Smith, Jeremy York. Amazon.com recommendations: Item-to-
item collaborative filtering, IEEE Internet Computing 7, 2003.
[Liu09] Ji Liu, Przemyslaw Musialski, Peter Wonka, Jieping Ye. Tensor Completion for
Estimating Missing Values in Visual Data, Computer Vision, 2009.
[Lon10] Philip M. Long, Rocco A. Servedio. Restricted Boltzmann Machines are Hard to
Approximately Evaluate or Simulate, ICML, 2010.
[Ma08] Shiquian Ma, Donald Goldfarb, Lifeng Chen. Fixed point and Bregman iterative
methods for matrix rank minimization, 2008.
[Mac03] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms,
Cambridge University Press, 2003.
[Mac10] Lester Mackey, David Weiss, Michael I. Jordan. Mixed Membership Matrix Fac-
torization, ICML, 2010.
[Mad03] M.R. Madruga, C.A.B. Pereira, J.M. Stern. Bayesian evidence test for precise
hypotheses, Journal of Statistical Planning and Inference, 2003.
[Mar05] Benjamin Marlin, Sam Roweis, Richard Zemel. Unsupervised Learning with Non-
Ignorable Missing Data, Workshop on Artificial Intelligence and Statistics, AISTAT,
2005.
[Mar08] Benjamin Marlin. Missing data problems in machine learning, PhD thesis, Uni-
versity of Toronto.
[Mar09] Benjamin M. Marlin and Richard Zemel. Collaborative Prediction and Ranking
with Non-Random Missing Data, RecSys, 2009.
[McF12] Brian McFee, Thierry Bertin-Mahieux, Daniel P.W. Ellis, Gert R.G. Lanckriet.
Collaborative Prediction and Ranking with Non-Random Missing Data, WWW, 2012.
[McK11] McKenzie et al.. Novel Models and Ensemble Techniques to Discriminate Fa-
vorite Items from Unrated Ones for Personalized Music Recommendation, KDD Cup
and Workshop, 2011.
[McM96] Daniel W. McMichael. Estimating Gaussian Mixture Models from Data with
Missing Features, Signal Processing and Applications, 1996.
[Mee09] Lydia Meesters, Paul Marrow, Bart Knijnenburg, Don Bouwhuis, Maxine Glancy.
ICT MyMedia Project. Deliverable 1.5. End-user recommendation evaluation metrics,
ICT MyMedia, 2009.
[Meh09] Bhaskar Mehta, Wolfgang Nejdl. Unsupervised strategies for shilling detection
and robust collaborative filtering, User Modeling and User-Adapted Interaction, 2009.
[Mey12] Frank Meyer. Recommender systems in industrial contexts, PhD thesis, 2012.
189
[Mil03] Bradley N. Miller, Istvan Albert, Shyong K. Lam, Joseph A. Konstan, John Riedl.
MovieLens Unplugged: Experiences with an Occasionally Connected Recommender
Systems, Intelligent User Interfaces, 2003.
[Mni10] Andriy Mnih. Learning Distributed Representations for Statistical Language Mod-
elling and Collaborative Filtering, PhD thesis, 2010.
[Mni11] Andriy Mnih. Taxonomy-informed latent factor models for implicit feedback, KDD
Cup and Workshop, 2011.
[Mon03] Miquel Montaner, Beatriz López, Josep Lluı́s de la Rosa. A Taxonomy of Rec-
ommender Agents on the Internet, Artificial Intelligence Review, 2003.
[Mos95] Klaus Mosegaard, Albert Tarantola. Monte Carlo sampling of solutions to inverse
problems, Journal of Geophysical Research, 1995.
[Nak10b] Shinichi Nakajima, Masashi Sugiyama, Ryota Tomioka. Global Analytic Solution
for Variational Bayesian Matrix Factorization, NIPS, 2010.
[Nak11a] Shinichi Nakajima, Masashi Sugiyama, Derin Babacan. On Bayesian PCA: Au-
tomatic Dimensionality Selection and Analytic Solution, ICML, 2011.
[New03] M.E.J. Newman. The structure and function of complex networks, SIAM Review,
2003.
[New10] Chris Newell, Bart Knijnenburg. ICT MyMedia Project. Deliverable 5.4. En-
hanced Internet A/V Content, ICT MyMedia, 2010.
[Paq10] Ulrich Paquet, Blaise Thomson, Ole Winther. Large-scale Ordinal Collaborative
Filtering, 1st Workshop on Mining the Future Internet, Future Internet Symposium,
2010.
[Pat07] Arkadiusz Paterek. Improving Regularized Singular Value Decomposition for Col-
laborative Filtering, Proc. KDD Cup and Workshop, 2007.
[Pea09] Judea Pearl. Causal inference in statistics: An overview Statistics Surveys, 2009.
190
[Per10] Patrick O. Perry, Art B. Owen. A Rotation Test to Verify Latent Structure, JMLR
11, 2010.
[Pil09b] István Pilászy, Domonkos Tikk. Recommending New Movies: Even a Few Ratings
Are More Valuable Than Metadata RecSys, 2009.
[Pil10] István Pilászy, Dávid Zibriczky, Domonkos Tikk. Fast als-based matrix factoriza-
tion for explicit and implicit feedback datasets, RecSys, 2010.
[Pio09] Martin Piotte, Martin Chabbert. The Pragmatic Theory solution to the Netflix
grand prize, 2009.
[Por08] Ian Porteous, Evgeniy Bart, Max Welling. Multi-HDP: A Non Parametric
Bayesian Model for Tensor Factorization, AAAI, 2008.
[Por10a] Ian Porteous, Arthur Asuncion, Max Welling. Bayesian Matrix Factorization
with Side Information and Dirichlet Process Mixtures, AAAI, 2010.
[Por10b] Ian Porteous. Mixture Block Methods for Non Parametric Bayesian Models with
Applications, PhD Thesis, 2010.
[Pot08] Gavin Potter. Putting the collaborator back into collaborative filtering, Proc. KDD
Workshop, 2008.
[Pra11] Bruno Pradel, Savaneary Sean, Julien Delporte, Sébastien Guérif, Céline Rou-
veirol, Nicolas Usunier, Françoise Fogelman-Soulié, Frédéric Dufau-Joel. A Case Study
in a Recommender System Based on Purchase Data, KDD, 2011.
[Rai07] Tapani Raiko, Alexander Ilin, Juha Karhunen. Principal Component Analysis for
Large Scale Problems with Lots of Missing Values, European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases, 2007.
[Ren05] Jason D.M.Rennie, Nathan Srebro. Fast Maximum Margin Matrix Factorization
for Collaborative Prediction, ICML 2005.
191
[Res94] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, John Riedl.
GroupLens: An Open Architecture for Collaborative Filtering of Netnews, ACM Con-
ference on Computer Supported Cooperative Work, 1994.
[Ric10] Eds: Francesco Ricci, Lior Rokach, Bracha Shapira, Paul Kantor. Recommender
Systems Handbook, Springer, 2010.
[Row98] Sam Roweis. EM Algorithms for PCA and SPCA, NIPS, 1998.
[Row99] Sam Roweis, Zoubin Ghahramani. A Unifying Review of Linear Gaussian Models,
Neural Computation, 1999.
[Sal07a] Ruslan Salakhutdinov, Andriy Mnih, Geoffrey Hinton. Restricted Boltzmann Ma-
chines for Collaborative Filtering, ICML, 2007.
[Sal09] Ruslan Salakhutdinov. Learning Deep Generative Models, PhD thesis, 2009.
[Sar00] B.M Sarwar, G. Karypis, Joseph A. Konstan, John Riedl. Application of Di-
mensionality Reduction in Recommender System–A Case Study, WebKDD workshop,
2000.
[Sar01] B.M Sarwar, G. Karypis, Joseph A. Konstan, John Riedl. Item-Based Collaborative
Filtering Recommendation Algorithm, WWW, 2001.
[Sch02a] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, David M. Pennock. Meth-
ods and Metrics for Cold-Start Recommendations, SIGIR, 2002.
[Sch05] Anton Schwaighofer, Volker Tresp, Kai Yu. Learning Gaussian Process Kernels
via Hierarchical Bayes, ICML, 2005.
[Sch07] J. Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. Collaborative
Filtering Recommender Systems, The Adaptive Web, 2007.
[Sel11] Joachim Selke, Wolf-Tilo Balke. Extracting Features from Ratings: The Role of
Factor Models, 2011.
192
[Shi00] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by
weighting the log-likelihood function, Journal of Statistical Planning and Inference,
2000.
[Sil09] Joseph Sill, Gabor Takacs, Lester Mackey, David Lin. Feature-Weighted Linear
Stacking, 2009.
[Sin08] Ajit P. Singh, Geoffrey J. Gordon. A Unified View of Matrix Factorization Models,
Machine Learning and Knowledge Discovery in Databases, 2008.
[Sin09] Ajit Paul Singh. Efficient Matrix Models for Relational Learning, PhD thesis,
Carnegie Mellon University, 2009.
[Ski07] David Skillicorn. Understanding Complex Datasets: Data Mining with Matrix De-
compositions, Chapman and Hall, 2007.
[Sre04] Nathan Srebro. Learning with matrix factorizations, PhD thesis, 2004.
[Sta11] Marius Stanescu Rating systems with multiple factors, MSc thesis, 2011.
[Ste09] David Stern, Ralf Herbrich, Thore Graepel. Matchbox: Large Scale Online
Bayesian Recommendations, WWW, 2009.
[Ste10a] Julie Steele, Noah Iliinsky. Beautiful Visualization: Looking at Data through the
Eyes of Experts; Chapter 9 by Todd Holloway O’Reilly Media, 2010.
[Ste10b] Harald Steck. Training and Testing of Recommender Systems on Data Missing
Not at Random, KDD, 2010.
[Sug08] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von
Bunau. Direct Importance Estimation for Covariate Shift Adaptation, Annals of the
Institute of Statistical Mathematics, 2008.
[Tak07a] Gabor Takacs, István Pilászy, Bottyan Nemeth, Domonkos Tikk. On the Gravity
Recommendation System Proc. KDD Cup and Workshop, 2007.
[Tak07b] Gabor Takacs, István Pilászy, Bottyan Nemeth, Domonkos Tikk. Major compo-
nents of the Gravity Recommendation System SIGKDD Explorations, 2007.
[Tak08a] Gabor Takacs, István Pilászy, Bottyan Nemeth, Domonkos Tikk. Investigation of
Various Matrix Factorization Methods for Large Recommender Systems Proc. KDD
Workshop, 2008.
[Tak08b] Gabor Takacs, István Pilászy, Bottyan Nemeth, Domonkos Tikk. Matrix factor-
ization and neighbor based algorithms for the netflix prize problem RecSys, 2008.
[Tak08c] Gabor Takacs, István Pilászy, Bottyan Nemeth, Domonkos Tikk. A Unified Ap-
proach of Factor Models and Neighbor Based Methods for Large Recommender Sys-
tems Applications of Digital Information and Web Technologies, 2008.
193
[Tak09a] Gabor Takacs, István Pilászy, Bottyan Nemeth, Domonkos Tikk. Scalable col-
laborative filtering approaches for large recommender systems JMLR 10, 2009.
[Tak09b] Gábor Takács. Convex polyhedron learning and its applications, PhD Thesis,
2009.
[Tan09] Tom F. Tan, Serguei Netessine. Is Tom Cruise Threatened? Using Netflix Prize
Data to Examine the Long Tail of Electronic Commerce, Working paper, 2009.
[Tar05] Albert Tarantola. Inverse Problem Theory and Methods for Model Parameter Es-
timation, SIAM, 2005.
[Tos08b] Andreas Toscher, Michael Jahrer. The BigChaos Solution to the Netflix Prize,
2008.
[Wu07] Mingrui Wu. Collaborative Filtering via Ensembles of Matrix Factorizations, Proc.
KDD Cup and Workshop, 2007.
[Wu08] Jinlong Wu, Tiejun Li. A Modified Fuzzy C-Means Algorithm for Collaborative
Filtering, Proc. KDD Workshop, 2008.
[Wu09] Jinlong Wu. Binomial Matrix Factorization for Discrete Collaborative Filtering,
ICDM, 2009.
[Xia09] Liang Xiang, Qing Yang. Time-dependent Models in Collaborative Filtering based
Recommender System, Web Intelligence and Intelligent Agent Technologies, 2009.
[Xio09] Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, Jaime G. Carbonell. Tem-
poral Collaborative Filtering with Bayesian Probabilistic Tensor Factorization, 2009.
194
[Yu03] Kai Yu, Anton Schwaighofer, Volker Tresp, Wei-Ying Ma, HongJiang Zhang. Col-
laborative Ensemble Learning: Combining Collaborative and Content-Based Informa-
tion Filtering via Hierarchical Bayes, UAI, 2003.
[Yu04] K. Yu, A. Schwaighofer, V. Tresp, W.-Y. Ma, H. Zhang. A nonparametric bayesian
framework for information filtering, SIGIR, 2004.
[Yu05a] Kai Yu, Volker Tresp, Anton Schwaighofer. Learning Gaussian Processes from
Multiple Tasks, ICML, 2005.
[Yu05b] Kai Yu, Volker Tresp. Learning to Learn and Collaborative Filtering, NIPS, 2005.
[Yu06a] Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, Zhao Xu. Stochastic Relational
Models for Discriminative Link Prediction, NIPS, 2006.
[Yu06b] Shipeng Yu, Kai Yu, Volker Tresp, Hans-Peter Kriegel. Collaborative Ordinal
Regression, ICML, 2006.
[Yu07a] Kai Yu, Wei Chu. Gaussian Process Models for Link Analysis and Transfer Learn-
ing, NIPS, 2007.
[Yu07b] Shipeng Yu, Volker Tresp, Kai Yu. Robust Multi-Task Learning with t-Processes,
ICML, 2007.
[Yu09a] Kai Yu, Shenghuo Zhu, John Lafferty, Yihong Gong. Fast Nonparametric Matrix
Factorization for Large-scale Collaborative Filtering, SIGIR, 2009.
[Yu09b] Kai Yu, John Lafferty, Shenghuo Zhu, Yihong Gong. Large-scale Collaborative
Prediction Using a Nonparametric Random Effects Model, ICML, 2009.
[Yue07] Yisong Yue, Thomas Finley, Filip Radlinski, Thorsten Joachims. A Support Vector
Method for Optimizing Average Precision, SIGIR, 2007.
[Zha05] Sheng Zhang, Weihong Wang, James Ford, Fillia Makedon, Justin Pearlman.
Using Singular Value Decomposition Approximation for Collaborative Filtering, IEEE
International Conference on E-Commerce Technology, 2005.
[Zha06] Sheng Zhang, Weihong Wang, James Ford, Fillia Makedon. Learning from Incom-
plete Ratings Using Non-negative Matrix Factorization, SDM, 2006.
[Zha07a] Yi Zhang, Jonathan Koren. Efficient Bayesian Hierarchical User Modeling for
Recommendation Systems, SIGIR, 2007.
[Zha07b] Yi-Cheng Zhang, Matus Medo, Jie Ren, Tao Zhou, Tao Li, Fan Yang. Recom-
mendation model based on opinion diffusion, Europhysics Letters, 2007.
[Zha07c] Yi-Cheng Zhang, Marcel Blattner, Yi-Kuo Yu. Heat Conduction Process on Com-
munity Networks as a Recommendation Model, Physical Review Letters, 2007.
[Zha09] Yi Zhang, Jiazhong Nie. Probabilistic Latent Relational Model for Integrating Het-
erogeneous Information for Recommendation. Technical Report, 2009.
[Zho08] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan. Large-scale Par-
allel Collaborative Filtering for the Netflix Prize, Algorithmic Aspects in Information
and Management, 2008.
[Zho10a] Tao Zhou, Zoltán Kuscsik, Jian-Guo Liu, Matúš Medo, Joseph R. Wakeling,
Yi-Cheng Zhang. Solving the apparent diversity-accuracy dilemma of recommender
systems, Proceedings of the National Academy of Sciences, 2010.
195
[Zho10b] Mingyuan Zhou, Chunping Wang, Minhua Chen, John Paisley, David Dunson,
Lawrence Carin. Nonparametric Bayesian Matrix Completion, IEEE Sensor Array
and Multichannel Signal Processing Workshop, 2010.
[Zhu08] Shenghuo Zhu, Kai Yu, Yihong Gong. Stochastic Relational Models for Large-scale
Dyadic Data using MCMC, NIPS, 2008.
[Zie05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen. Improving
Recommendation Lists Through Topic Diversification, WWW, 2005.
Abbreviations:
KDD - ACM SIGKDD Conference on Knowledge Discovery and Data Mining
ICDM - IEEE International Conference on Data Mining
ICML - International Conference on Machine Learning
JMLR - Journal of Machine Learning Research
NIPS - Neural Information Processing Systems Conference
RecSys - ACM Conference on Recommender Systems
SDM - SIAM International Conference on Data Mining
SIGIR - ACM Special Interest Group on Information Retrieval Conference
WSDM - ACM International Conference on Web Search and Data Mining
WWW - International World Wide Web Conference
196