Metrics and Evaluation
Metrics and Evaluation
Abstract
Recommender systems are now popular both commercially and in the research community, where
many algorithms have been suggested for providing recommendations. These algorithms typically
perform differently in various domains and tasks. Therefore, it is important from the research
perspective, as well as from a practical view, to be able to decide on an algorithm that matches
the domain and the task of interest. The standard way to make such decisions is by comparing a
number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have
been suggested for comparing recommendation algorithms. The decision on the proper evaluation
metric is often critical, as each metric may favor a different algorithm. In this paper we review
the proper construction of offline experiments for deciding on the most appropriate algorithm. We
discuss three important tasks of recommender systems, and classify a set of appropriate well known
evaluation metrics for each task. We demonstrate how using an improper evaluation metric can lead
to the selection of an improper algorithm for the task of interest. We also discuss other important
considerations when designing offline experiments.
Keywords: recommender systems, collaborative filtering, statistical analysis, comparative studies
1. Introduction
Recommender systems can now be found in many modern applications that expose the user to a
huge collections of items. Such systems typically provide the user with a list of recommended
items they might prefer, or supply guesses of how much the user might prefer each item. These
systems help users to decide on appropriate items, and ease the task of finding preferred items in
the collection.
For example, the DVD rental provider Netflix1 displays predicted ratings for every displayed
movie in order to help the user decide which movie to rent. The online book retailer Amazon2
provides average user ratings for displayed books, and a list of other books that are bought by
users who buy a specific book. Microsoft provides many free downloads for users, such as bug
fixes, products and so forth. When a user downloads some software, the system presents a list
of additional items that are downloaded together. All these systems are typically categorized as
recommender systems, even though they provide diverse services.
In the past decade, there has been a vast amount of research in the field of recommender sys-
tems, mostly focusing on designing new algorithms for recommendations. An application designer
who wishes to add a recommendation system to her application has a large variety of algorithms at
her disposal, and must make a decision about the most appropriate algorithm for her application.
Typically, such decisions are based on offline experiments, comparing the performance of a number
of candidate algorithms over real data. The designer can then select the best performing algorithm,
given structural constraints. Furthermore, most researchers who suggest new recommendation al-
gorithms also compare the performance of their new algorithm to a set of existing approaches. Such
evaluations are typically performed by applying some evaluation metric that provides a ranking of
the candidate algorithms (usually using numeric scores).
Many evaluation metrics have been used to rank recommendation algorithms, some measuring
similar features, but some measuring drastically different quantities. For example, methods such
as the Root of the Mean Square Error (RMSE) measure the distance between predicted preferences
and true preferences over items, while the Recall method computes the portion of favored items that
were suggested. Clearly, it is unlikely that a single algorithm would outperform all others over all
possible methods.
Therefore, we should expect different metrics to provide different rankings of algorithms. As
such, selecting the proper evaluation metric to use has a crucial influence on the selection of the
recommender system algorithm that will be selected for deployment. This survey reviews existing
evaluation metrics, suggesting an approach for deciding which evaluation metric is most appropriate
for a given application.
We categorize previously suggested recommender systems into three major groups, each cor-
responding to a different task. The first obvious task is to recommend a set of good (interesting,
useful) items to the user. In this task it is assumed that all good items are interchangeable. A
second, less discussed, although highly important task is utility optimization. For example, many
e-commerce websites use a recommender system, hoping to increase their revenues. In this case,
the task is to present a set of recommendations that will optimize the retailer revenue. Finally, a
very common task is the prediction of user opinion (e.g., rating) over a set of items. While this may
not be an explicit act of recommendation, much research in recommender systems focuses on this
task, and so we address it here.
For each such task we review a family of common evaluation metrics that measure the perfor-
mance of algorithms on that task. We discuss the properties of each such metric, and why it is most
appropriate for a given task.
In some cases, applying incorrect evaluation metrics may result in selecting an inappropriate
algorithm. We demonstrate this by experimenting with a wide collection of data sets, comparing a
number of algorithms using various evaluation metrics, showing that the metrics rank the algorithms
differently.
We also discuss the proper design of an offline experiment, explaining how the data should
be split, which measurements should be taken, how to determine if differences in performance are
statistically significant, and so forth. We also describe a few common pitfalls that may produce
results that are not statistically sound.
The paper is structured as follows: we begin with some necessary background on recommender
approaches (Section 2). We categorize recommender systems into a set of three tasks in Section 3.
2936
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
We then discuss evaluation protocols, including online experimentation, offline testing, and statis-
tical significance testing of results in Section 4. We proceed to review a set of existing evaluation
metrics, mapping them to the appropriate task (Section 5). We then provide (Section 6) some ex-
amples of applying different metrics to a set of algorithms, resulting in questionable rankings of
these algorithms when inappropriate measures are used. Following this, we discuss some addi-
tional relevant topics that arise (Section 7) and some related work (Section 8), and then conclude
(Section 9).
2. Algorithmic Approaches
There are two dominant approaches for computing recommendations for the active user—the user
that is currently interacting with the application and the recommender system. First, the collabora-
tive filtering approach (Breese et al., 1998) assumes that users who agreed on preferred items in the
past will tend to agree in the future too. Many such methods rely on a matrix of user-item ratings to
predict unknown matrix entries, and thus to decide which items to recommend.
A simple approach in this family (Konstan et al., 2006), commonly referred to as user based
collaborative filtering, identifies a neighborhood of users that are similar to the active user. This
set of neighbors is based on the similarity of observed preferences between these users and the
active user. Then, items that were preferred by users in the neighborhood are recommended to the
active user. Another approach (Linden et al., 2003), known as item based collaborative filtering
recommends items also prefered by users that prefer a particular active item to other users that
also prefer that active item. In collaborative filtering approaches, the system only has access to the
item and user identifiers, and no additional information over items or users is used. For example,
websites that present recommendations titled “users who preferred this item also prefer” typically
use some type of collaborative filtering algorithm.
A second popular approach is the content-based recommendation. In this approach, the system
has access to a set of item features. The system then learns the user preferences over features, and
uses these computed preferences to recommend new items with similar features. Such recommen-
dations are typically titled “similar items”. User’s features, if available, such as demographics (e.g.,
gender, age, geographic location) can also provide valuable information.
Each approach has advantages and disadvantages, and a multitude of algorithms from each
family, as well as a number of hybrid approaches have been suggested. This paper, though, makes no
distinction between the underlying recommendation algorithms when evaluating their performance.
Just as users should not need to take into account the details of the underlying algorithm when using
the resulting recommendations, it is inappropriate to select different evaluation metrics for different
recommendation approaches. In fact, doing so would make it difficult to decide which approach to
employ in a particular application.
2937
G UNAWARDANA AND S HANI
in the proper evaluation of such algorithms, and our classification is derived from that goal. While
there may be recommender systems that do not fit well into the classes that we suggest, we believe
that the vast majority of the recommender systems attempt to achieve one of these tasks, and can
thus be classified as we suggest.
2938
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
service, where revenue comes from users paying a regular subscription, the goal may be to allow
users to easily reach items of interest. In this case, the system should suggest items such that the
user reaches items of interest with minimal effort.
The utility function to be optimized can be more complicated, and in particular, may be a func-
tion of the entire set of recommendations and their presentation to the user. For example, in pay-
per-click search advertising, the system must recommend advertisements to be displayed on search
results pages. Each advertiser bids a fixed amount that is paid only when the user clicks on their
ad. If we wish to optimize the expected system profit, both the bids and the probability that the user
will click on each ad must be taken into account. This probability depends on the relevance of each
ad to the user and the placement of the different ads on the page. Since the different ads displayed
compete for the user’s attention, the utility function depends on the entire set of ads displayed, and
is not additive over the set (Gunawardana and Meek, 2008).
In all of these cases, it may be suboptimal to suggest items based solely on their predicted
rating. While it is certainly beneficial to recommend relevant items, other considerations are also
important. For example, in the e-commerce scenario, given two items that the system perceives
as equally relevant, suggesting the item with the higher profit can further increase revenue. In the
online news agency case, recommending longer stories may be beneficial, because reading them
will keep the user in the website longer. In the subscription service, recommending items that are
harder for the user to reach without the recommender system may be beneficial.
Another common practice of recommendation systems is to suggest recommendations that pro-
vide the most “value” to the user. For example, recommending popular items can be redundant, as
the user is probably already familiar with them. A recommendation of a preferred, yet unknown
item can provide a much higher value for the user.
Such approaches can be viewed as instances of providing recommendations that maximize some
utility function that assigns a value to each recommendation. Defining the correct utility function
for a given application can be difficult (Braziunas and Boutilier, 2005), and typically system design-
ers make simplifying assumptions about the user utility function. In the e-commerce case the utility
function is typically the profit resulting from recommending an item, and in the news scenario the
utility can be the expected time for reading a news item, but these choices ignore the effect of the
resulting recommendations on long-term profits. When we are interested in novel recommenda-
tions, the utility can be the log of the inverse popularity of an item, modeling the amount of new
information in a recommended item (Shani et al., 2005), but this ignores other aspects of user-utility
such as the diversity of recommendations.
In fact, it is possible to view many recommendation tasks, such as providing novel or serendipi-
tious recommendations as maximizing some utility function. Also, the “recommend good items” of
the previous section can be considered as optimizing for a utility function assigning a value of 1 to
each successful recommendation. In this paper, due to the popularity of the former task, we choose
to keep the two tasks distinct.
2939
G UNAWARDANA AND S HANI
search for, say, laptops that cost between $400 and $700. The system adds to some laptops in the
list an automatically computed rating, based on the laptop features.
It is arguable whether this task is indeed a recommendation task. However, many researchers in
the recommendations system community attempting to find good algorithms for this task. Examples
include the Netflix competition, which was warmly embraced by the research community, and the
numerous papers on predicting ratings on the Netflix or MovieLens4 data sets.
While such systems do not provide lists of recommended items, predicting that the user will rate
an item highly can be considered an act of recommendation. Furthermore, one can view a predicted
high rating as a recommendation to use the item, and a predicted low rating as a recommendation
to avoid the item. Indeed, it is common practice to use predicted ratings to generate a list of recom-
mendations. Below, we will present several arguments of cases where this common practice may be
undesirable.
4. Evaluation Protocols
We now discuss an experimental protocol for evaluating and choosing recommendation algorithms.
We review several requirements to ensure that the results of the experiments are statistically sound.
We also describe several common pitfalls in such experimental settings. This section reviews the
evaluation protocols in related areas such as machine learning and information retrieval, highlight-
ing practices relevant to evaluating recommendation systems. The reader is referred to publications
in these fields for more detailed discussions (Salzberg, 1997; Demšar, 2006; Voorhees, 2002a).
We begin by discussing online experiments, which can measure the real performance of the
system. We then argue that offline experiments are also crucial, because online experiments are
costly in many cases. Therefore, the bulk of the section discusses the offline experimental setting in
detail.
2940
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
alternatives are fair. It is also important to single out the different aspects of the recommenders.
For example, if we care about algorithmic accuracy, it is important to keep the user interface fixed.
On the other hand, if we wish to focus on a better user interface, it is best to keep the underlying
algorithm fixed.
However, in a multitude of cases, such experiments are very costly, since creating online testing
systems may require much effort. Furthermore, we would like to evaluate our algorithms before
presenting their results to the users, in order to avoid a negative user experience for the test users.
For example, a test system that provides irrelevant recommendations, may discourage the test users
from using the real system ever again. Finally, designers that wish to add a recommendation system
to their application before its deployment do not have an opportunity to run such tests.
For these reasons, it is important to be able to evaluate the performance of algorithms in an
offline setting, assuming that the results of these offline tests correlate well with the online behavior
of users.
2941
G UNAWARDANA AND S HANI
taking into account any new data that arrives after the test time. Another alternative is to sample a
test time for each test user, and hide the test user’s items after that time, without maintaining time
consistency across users. This effectively assumes that it is the sequence in which items are selected,
and not the absolute times when they are selected that is important. A final alternative is to ignore
time; We sample a set of test users, then sample the number na of items to hide for each user a, then
sample na items to hide. This assumes that the temporal aspects of user selections are unimportant.
All three of the latter alternatives partition the data into a single training set and single test set. It is
important to select an alternative that is most appropriate for the domain and task of interest, rather
than the most convenient one.
A common protocol used in many research papers is to use a fixed number of known items or a
fixed number of hidden items per test user (so called “given n” or “all but n” protocols). This pro-
tocol is useful for diagnosing algorithms and identifying in which cases they work best. However,
when we wish to make decisions on the algorithm that we will use in our application, we must ask
ourselves whether we are truly interested in presenting recommendations for users who have rated
exactly n items, or are expected to rate exactly n items more. If that is not the case, then results
computed using these protocol have biases that make them difficult to use in predicting the outcome
of using the algorithms online.
The evaluation protocol we suggest above generates a test set (Duda and Hart, 1973) which is
used to obtain held-out estimates for algorithm performance, using performance measures which
we discuss below. Another popular alternative is to use cross-validation (Stone, 1974), where the
data is divided into a number of partitions, and each partition in turn is used as a test set. The
advantages of the cross-validation approach are to allow the use of more data in ranking algorithms,
and to take into account the effect of training set variation. In the case of recommender systems,
the held-out approach usually yields enough data to make reliable decisions. Furthermore, in real
systems, the problem of variation in training data is avoided by evaluating systems trained on the
historical data specific to the task at hand. In addition, there is a risk that since the results on the
different data partitions are not independent of each other, pooling the results across partitions for
ranking algorithms can lead to statistically unjustified decisions (Bengio and Grandvalet, 2004).
2942
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
users are drawn independently from some population, the performance measures of the algorithms
for each test user give us the independent comparisons we need. However, when recommendations
or predictions of multiple items are made to the same user, it is unlikely that the resulting per-item
performance metrics are independent. Therefore, it is better to compare algorithms on a per-user
case. Approaches for use when users have not been sampled independently also exist, and attempt
to directly model these dependencies (see, e.g., Larocque et al. 2007). Care should be exercised
when using such methods, as it can be difficult to verify that the modeling assumptions that they
depend on hold in practice.
Given such paired per-user performance measures for algorithms A and B the simplest test of
significance is the sign test (Demšar, 2006). In this test, we count the number of users for whom al-
gorithm A outperforms algorithm B (nA ) and the number of users for whom algorithm B outperforms
algorithm A (nB ). The probability that A is not truly better than B is estimated as the probability of
at least nA out of nA + nB 0.5-probability Binomial trials succeeding (that is, nA out of nA + nB fair
coin-flips coming up “heads”).
na +nB
(nA + nB )!
pr(successes ≥ nA |A = B) = 0.5nA +nB ∑ k!(ni + nB − k)!
.
k=nA
The sign test is an attractive choice due to its simplicity, and lack of assumptions over the
distribution of cases. Still this test may lead to mislabeling of significant results as insignificant
when the number of test points is small. In these cases, the more sophisticated Wilcoxon signed
rank test can be used (Demšar, 2006). As mentioned in Section 4.2, cross-validation can be used to
increase the amount of data, and thus the significance of results, but in this case the results obtained
on the cross-validated test sets are no longer independent, and care must be exercised to ensure that
our decisions account for this (Bengio and Grandvalet, 2004). Also, model-based approaches (e.g.,
Goutte and Gaussier, 2005) may be useful when the amount of data is small, but once again, care
must be exercised to ensure that the model assumptions are reasonable for the application at hand.
Another important consideration is the effect of evaluating multiple versions of algorithms. For
example, an experimenter might try out several variants of a novel recommender algorithm and
compare them to a baseline algorithm until they find one that passes a sign test at the p = 0.05 level
and therefore infer that their algorithm improves upon the baseline with 95% confidence. However,
this is not a valid inference. Suppose the experimenter evaluated ten different variants all of which
are statistically the same as the baseline. If the probability that any one of these trials passes the
sign test mistakenly is p = 0.05, the probability that at least one of the ten trials passes the sign test
mistakenly is 1 − (1 − 0.05)20 = 0.40. This risk is colloquially known as “tuning to the test set” and
can be avoided by separating the test set users into two groups—a development (or tuning) set, and
an evaluation set. The choice of algorithm is done based on the development test, and the validity
of the choice is measured by running a significance test on the evaluation set.
A similar concern exists when ranking a number of algorithms, but is more difficult to circum-
vent. Suppose the best of N + 1 algorithms is chosen on the development test set. We can have
a confidence 1 − p that the chosen algorithm is indeed the best, if it outperforms the N other al-
gorithms on the evaluation set with significance 1 − (1 − p)1/N . This is known as the Bonferroni
correction, and should be used when pair-wise significant tests are used multiple times. Alterna-
tively, the Friedman test for ranking can be used (Demšar, 2006).
2943
G UNAWARDANA AND S HANI
5. Evaluating Tasks
An application designer that wishes to employ a recommendation system typically knows the pur-
pose of the system, and can map it into one of the tasks defined above—recommendation, utility
optimization, and ratings prediction. Given such a mapping, the designer must now decide which
evaluation metric to use in order to rank a set of candidate recommendation algorithms. It is impor-
tant that the metric match the task, to avoid an inappropriate ranking of the candidates.
Below we provide an overview of a large number of evaluation metrics that have been suggested
in the recommendation systems literature. For each such metric we identify its important properties
and explain why is it most appropriate for the given task. For each task we also explain a possible
evaluation scenario that can be used to evaluate the various algorithms.
Other variants of this family are the Mean Square Error (which is equivalent to RMSE) and Mean
Average Error (MAE), and Normalized Mean Average Error (NMAE) (Herlocker et al., 2004).
RMSE tends to penalize larger errors more severely than the other metrics, while NMAE normalizes
MAE by the range of the ratings for ease of comparing errors across domains.
RMSE is suitable for the prediction task, because it measures inaccuracies on all ratings, either
negative or positive. However, it is most suitable for situations where we do not differentiate be-
tween errors. For example, in the Netflix rating prediction, it may not be as important to properly
predict the difference between 1 and 2 stars as between 2 and 3 stars. If the system predicts 2 instead
of the true 1 rating, it is unlikely that the user will perceive this as a recommendation. However, a
predicted rating of 3 may seem like an encouragement to rent the movie, while a prediction of 2 is
typically considered negative. It is arguable that the space of ratings is not truly uniform, and that it
can be mapped to a uniform space to avoid such phenomena.
2944
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
elsewhere. The task is to provide, given an existing list of items that were viewed, a list of additional
items that the user may want to visit.
As we have explained above, these scenarios are typically not symmetric. We are not equally
interested in good and bad items; the task of the system is to suggest good items, not to discourage
the use of bad items. We can classify the results of such recommendations using Table 1.
We can now count the number of examples that fall into each cell in the table and compute the
following quantities:
#tp
Precision = ,
#tp + #fp
#tp
Recall (True Positive Rate) = ,
#tp + #fn
#fp
False Positive Rate (1 - Specificity) = .
#fp + #tn
Typically we can expect a trade off between these quantities—while allowing longer recommenda-
tion lists typically improves recall, it is also likely to reduce the precision. In some applications,
where the number of recommendations that are presented to the user is not preordained, it is there-
fore preferable to evaluate algorithms over a range of recommendation list lengths, rather than using
a fixed length. Thus, we can compute curves comparing precision to recall, or true positive rate to
false positive rate. Curves of the former type are known simply as precision-recall curves, while
those of the latter type are known as a Receiver Operating Characteristic5 or ROC curves.
While both curves measure the proportion of preferred items that are actually recommended,
precision-recall curves emphasize the proportion of recommended items that are preferred while
ROC curves emphasize the proportion of items that are not preferred that end up being recom-
mended.
We should select whether to use precision-recall or ROC based on the properties of the domain
and the goal of the application; suppose, for example, that an online video rental service recom-
mends DVDs to users. The precision measure describes what proportion of their recommendations
were actually suitable for the user. Whether the unsuitable recommendations represent a small or
large fraction of the unsuitable DVDs that could have been recommended (that is, the false positive
rate) may not be as relevant.
On the other hand, consider a recommender system for an online dating site. Precision describes
what proportion of the suggested pairings for a user result in matches. The false positive rate
describes what proportion of unsuitable candidates are paired with the active user. Since presenting
unsuitable candidates can be especially undesirable in this setting, the false positive rate could be
the most important factor.
2945
G UNAWARDANA AND S HANI
Given two algorithms, we can compute a pair of such curves, one for each algorithm. If one
curve completely dominates the other curve, the decision about the winning algorithm is easy. How-
ever, when the curves intersect, the decision is less obvious, and will depend on the application in
question. Knowledge of the application will dictate which region of the curve the decision will be
based on. For example, in the “recommend some good items” task it is likely that we will prefer a
system with a high precision, while in the “recommend all good items” task, a higher recall rate is
more important than precision.
Measures that summarize the precision recall of ROC curve such as F-measure (Rijsbergen,
1979) and the area under the ROC curve (Bamber, 1975) are useful for comparing algorithms inde-
pendently of application, but when selecting an algorithm for use in a particular task, it is preferable
to make the choice based on a measure that reflects the specific needs at hand.
2946
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
ages of the top five recommendations prominently arranged horizontally across the top of the screen,
the user will probably observe them all and select the items of interest. However, if all the recom-
mendations are presented in a textual list several pages long, the user will probably scan down the
list and abandon their scan at some point. In the first case, utility delivered by the top five recom-
mendations actually selected would be a good estimate of expected utility, while in the second case,
we would have to model the way users scan lists.
The half-life utility score of Breese et al. (1998) suggested such a model. It postulates that the
probability that the user will select a relevant item drops exponentially down the list.
This approach evaluates an unbounded recommendation list, that potentially contains all the
items in the catalog. Given such a list we assume that the user looks at items starting from the top.
1
We then assume that an item at position k has a probability of 2(k−1)/(α−1) of being viewed, where α
is a half life parameter, specifying the location of the item in the list with 0.5 probability of being
viewed.
In the binary case of the recommendation task the half-life utility score is computed by:
1
Ra = ∑ 2(idx( j)−1)/(α−1) ,
j
∑ a Ra
R = ,
∑a Rmax
a
where the summation in the first equation is over the preferred items only, idx( j) is the index of
item j in the recommendation list, and Rmax a is the score of the best possible list of recommendations
for user a.
More generally, we can plug any utility function u(a, j) that assigns a value to a user item pair
into the half-life utility score, obtaining the following formula:
u(a, j)
Ra = ∑ .
j 2(idx( j)−1)/(α−1)
Now, Rmax
a is the score for the list of the recommendation where all the observed items are ordered
by decreasing utility. In applications where the probability that a user will select the idxth item if it
is relevant is known, a further generalization would be to use these known probabilities instead of
the exponential decay.
2947
G UNAWARDANA AND S HANI
utility score), become less appropriate. For example, a system may get a relatively high half-life
utility score, only due to items that fall outside the fixed list, while another system that selects all
the items in the list correctly, and uninteresting items elsewhere, might get a lower score. Precision-
recall curves are typically used to help us select the proper list length, where the precision and recall
reach desirable values.
Another important difference, is that for a small list, the order of items in the list is less impor-
tant, as we can assume that the user looks at all the items in the list. Moreover, may of these lists
are presented in a horizontal direction, which also reduces the importance of properly ordering the
items.
In these cases, therefore, a more appropriate way to evaluate the recommendation system should
focus on the first N movies only. In the “recommend good items” task this can be done, for example,
by measuring the precision at N—the number of items that are interesting out of the recommended
N items. In the “optimize utility” task, we can do so by measuring the aggregated utility (e.g., sum
of utility) of the items that are indeed interesting within the N recommendations.
A final case is when we have unlimited recommendation lists in the “recommend good items”
scenario, and we wish to evaluate the entire list. In this case, one can use the half-life utility score
with a binary utility of 1 when the (hidden) item was indeed selected by the user, and 0 otherwise.
In that case, the half-life utility score prefers a recommender system that places interesting items
closer to the head of the list, but provides an evaluation for the entire list in a single score.
6. Empirical Evaluation
In some cases, two metrics may provide a different ranking of two algorithms. When one metric
is more appropriate for the task at hand, using the other metric may result in selecting the wrong
algorithm. Therefore, it is important to choose the appropriate evaluation metric for the task at hand.
In this section we provide some empirical examples of the phenomenon we describe above,
that is, where different metrics rank algorithms differently. Below, we present examples where
algorithms are ranked differently by two metrics, one of which is more appropriate for the task of
interest.
2948
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
our experiments, as we are working with simple algorithms, we have reduced the data set to users
who rated more than 100 movies, leaving us with 21, 179 users, 17, 415 movies, and 117 ratings per
user on average. Thus, our results are not comparable to results published in the online competition
scoreboard.
BookCrossing: The BookCrossing website7 allows a community of book readers to share their
interests in books, and to review and discuss books. Within that system users can provide ratings
on the scale of 1 to 10 stars. The specific data set that we used was collected by a 4 week crawl
during August and September 2004 (Ziegler et al., 2005). The data set contains 105, 283 users and
340, 556 books (we used just the subset containing explicit ratings). Average ratings for a user is
10. This data set is even more sparse than the Netflix data set that we used, as there are more items
and less ratings per user.
Both data sets share some common properties. First, people watch many movies and read many
books, compared with other domains. For example, most people experience with only a handful of
laptop computers, and so cannot form an opinion on most laptops. Ratings are also skewed towards
positive ratings in both cases, as people are likely to watch movies that they think they will like, and
even more so in the case of books, which require a heavier investment of time.
There are also some distinctions between the data sets. Some people feel compelled to share
their opinion about books and movies, without asking for a compensation. However, in the Netflix
domain, providing ratings makes it easier to navigate the system and rent movies. Therefore, all
users of Netflix have an incentive for providing ratings, while only people who like to share their
views of books use the BookCrossing system. We can therefore expect that the ratings of the
BookCrossing are less representative of the general population of book readers, than the ratings
of Netflix user from the general population of DVD renters.
One instance of the “recommend good items” task is the case where, given a set of items that the
user has used (bought, viewed), we wish to recommend a set of items that are likely to be used.
Typically, data sets of usage are binary—an item was either used or wasn’t used by the user, and the
data set is not sparse, because every item is either used or not used by every user. We used here a
data set of purchases from supermarket retailer, and a stream of articles that were viewed in a news
website.
Belgian retailer: This data set was collected from an anonymous Belgian retail supermarket
store, collected over approximately 5 months, in three non-consecutive periods during 1999 and
2000. The data set is divided into baskets, and we cannot detect return users. There are 88, 162
baskets, 16, 470 distinct items, and 10 items in an average basket. We do not have access for item
prices or profits, so we cannot optimize the retailer revenue. Therefore the task is to recommend
more items that the user may want to add to the basket.
News click stream: This is a log of click-stream data of an Hungarian online news portal
(Bodon, 2003). The data contains 990, 002 sessions, 41, 270 news stories, and an average of 8
stories for session. The task is, given the news items that a user has read so far, recommend more
news items that the user will likely read.
2949
G UNAWARDANA AND S HANI
Perhaps the most popular method for computing the weights w(a, i) is by using the Pearson
correlation coefficient (Resnick and Varian, 1997):
∑ j (va, j − v̄a )(vi, j − v̄i )
w(a, i) = q
∑ j (va, j − v̄a )2 ∑ j (vi, j − v̄i )2
where the summations are only over the items that both a and i have rated. To reduce the computa-
tional overhead, we use in Equation 1 a neighborhood of size N.
This method is specifically designed for the prediction task, as it computes only a predicted score
for each item of interest. However, in many cases people used this method for the recommendation
task. This is typically done by predicting the scores for all possible items, and then ordering the
items by decreasing predicted scores.
2950
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
This popular usage may not be appropriate. For example, in the movie domain people may
associate ratings with quality, as opposed to enjoyment, which is dependent on external factors
such as mood, time of day, and so forth. As such, 5 stars movies may be complicated, requiring a
substantial effort from the viewer. Thus, a user may rent many light effortless romantic comedies,
which may only get a score of 3 stars, and only a few 5 star movies. While it is difficult to measure
this effect without owning a rental store, we computed the average number of ratings for movies
with different average rating (Figure 6.2.1). This figure may suggest that movies with higher ratings
are not always watched more often than movies with lower ratings. If our assumption is true, a
system that recommends items to add to the rental queue by order of decreasing predicted rating,
may not do as well as a system that predicts the probability of adding a movie to the queue directly.
Figure 1: Computing the average number of ratings (popularity) of movies binned given their aver-
age ratings.
When computing the cosine similarity, only positive ratings have a role, and negative ratings are
discarded. Thus, Ii is the set of items that user i has rated positively and Ia,i is the set of items that
both users rated positively. Also, the predicted score for a user is computed by:
n
pa, j = κ ∑ w(a, i)vi, j .
i=1
In the case of binary data sets, such as the usage data sets that we selected for the recommenda-
tion task, the vector similarity method becomes:
|Ia,i |
w(a, i) = p p
|Ia | · |Ii |
2951
G UNAWARDANA AND S HANI
where Ia is the set of items that a used, and Ia,i is the set of items that both a and i used. The resulting
aggregated score can be considered as a non-calibrated measurement of the conditional probability
pr( j|a)—the probability that user a will choose item j.
In binary usage data sets, the Pearson correlation method would compute similarity using all
the items, as each item always has a rating. Therefore, the system would use all the negative “did
not use” scores, which typically greatly outnumber the “used” scores. We can therefore expect that
Pearson correlation in these cases will result in lower accuracy.
The above two methods focused on computing a similarity between users, but another possible
collaborative filtering alternative is to focus on the similarity between items. The simplest method
for doing so is to use the maximum likelihood estimate for the conditional probabilities of items.
Specifically, for the binary usage case, this translates to:
|J j1 , j2 |
pr( j1 | j2 ) =
|J j2 |
where J j is the number of users who used item j, and J j1 , j2 is the number of users that used both j1
and j2 . While this seems like a very simple estimation, similar estimations are successfully used in
deployed commercial applications (Linden et al., 2003).
Typically, an algorithm is given as an input a set of items, and needs to produce a list of recom-
mendations. In that case, we can compute for the conditional probability of each target item given
each observed item, and then aggregate the results over the set of given items. In many cases, choos-
ing the maximal estimate has given the best results (Kadie et al., 2002), so we aggregate estimations
using a max operator in our experiments.
2952
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
Netflix BookCrossing
Pearson 1.07 3.58
Cosine 1.90 4.5
Table 2: RMSE scores for Pearson correlation and Cosine similarity on the Netflix domain (ratings
from 1 to 5) and the BookCrossing domain (ratings from 1 to 10).
2953
G UNAWARDANA AND S HANI
section. It may well be that different algorithms that were trained over different data sets (ratings
vs. rentals) may rank differently in different tasks. Deciding on the best recommendation engine
based solely on RMSE in the prediction task may lead to worse recommendation lists in the two
other cases.
2954
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
Figure 3: Comparing recommendations generated by the item-item recommender and the expected
profit recommender on the Ta-Feng data set.
We then measured an half-life utility score where the utility of a correct recommendation was
the profit from selling the correctly recommended item to the user. The results are shown in Table 3.
2955
G UNAWARDANA AND S HANI
Score
Item-Item 0.01
Exp. Profit 0.05
Table 3: Comparing item-item vs. expected utility recommendations on the Ta-Feng data set with
the half-life utility score. The utility of a correct recommendation was the profit from
selling that item to that user, while the half-life was 5. The trends were similar for other
choices of the half-life parameter.
7. Discussion
Above, we discussed the major considerations that one should make when deciding on the proper
evaluation metric for a given task. We now add some discussion, illustrating other conclusions that
can be derived, and illuminating some other relevant topics.
This survey focuses on the evaluation of recommendation algorithms. However, the success of a
recommendation system does not depend solely on the quality of the recommendation algorithm.
Such systems typically attempt to modify user behavior which is influenced by many other param-
eters, most notably, by the user interface. The success of the deployed system in influencing users
can be measured through the change in user behavior, such as the number of recommendations that
are followed, or the change in revenue.
Decisions about the interface by which users view recommendations are critical to the success
of the system. For example, recommendations can be located in different places in the page, can be
displayed horizontally or vertically, can be presented through images or text, and so forth. These
decisions can make a significant impact, no smaller than the quality of the underlying algorithm, on
the success of a system.
When the application is centered around the recommendation system, it is important to select
the user interface together with the recommendation algorithm. In other cases, the recommendation
system is only a supporting system for the application. For example, an e-commerce website is
centered around the item purchases, and a news website is centered around the delivery of news
stories. In both cases, a recommender system may be employed to help the users navigate, or to
increase sales. It is likely that the recommendations are not the major method for browsing the
collection of items.
2956
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
When the recommender system is only a supporting system, the designer of the application will
probably make the decision about the user interface without focusing on positioning recommenda-
tions where they have the most influence. In such cases, which we believe to be very common, the
developer of the recommender system is constrained by the pre-designed interface, and in many
cases can therefore only decide on the best recommendation algorithm, and in some cases perhaps
the length of the recommendation list. This paper is targeted at researchers and developers who are
making decisions about algorithms, not about the user interface. Designing a good user interface is
an interesting and challenging problem, but it is outside the scope of this survey (see, e.g., Pu and
Chen 2006).
2957
G UNAWARDANA AND S HANI
For example, a user may have a positive opinion over many laptop computers, and may rate
many laptops highly. However, most people buy only one laptop. In that case, recommending more
laptops, based on the co-occurring high ratings, will be inappropriate. However, if we predict the
probability of buying a laptop given that another laptop has already been bought, we can expect this
probability to be low, and the other laptop will not be recommended.
8. Related Work
In the past, different researchers discussed various topics relevant to the evaluation of recommender
systems.
Breese et al. (1998) were probably the first to provide a sound evaluation of a number of rec-
ommendation approaches over a collection of different data sets, setting the general framework of
evaluating algorithms on more than a single real world data set, and a comparison of several algo-
rithms in identical experiments. The practices that were illustrated in that paper are used in many
modern publications.
Herlocker et al. (2004) provide an extensive survey of possible metrics for evaluation. They
then compare a set of metrics, concluding that for some pairs of metrics, using both together will
give very little additional information compared to using just one.
Another interesting contribution of that paper is a classification of recommendation engines
from the user task perspective, namely, what are the reasons and motivations that a user has when
interacting with a recommender system. As they are interested in user tasks, and we are interested
in the system tasks, our classification is different, yet we share some similar tasks, such as the
“recommend some good items” and “recommend all good items” tasks.
Finally, their survey attempted to cover as many evaluation metrics and user task variations as
possible, we focus here on the appropriate metrics for the most popular recommendation tasks only.
Mcnee et al. (2003) explain why accuracy metrics alone are insufficient for selecting the correct
recommendation algorithm. For example, users may be interested in the serendipity of the rec-
ommended items. One way to model serendipity is through a utility function that assigns higher
values to “unexpected” suggestions. They also discuss the “usefulness” of recommendations. Many
utility functions, such as the inverse log of popularity (Shani et al., 2005) attempt to capture this
“usefulness”.
Ziegler et al. (2005) focus on another aspect of evaluation—considering the entire list together.
This would allow us to consider aspects of a set of recommendations, such as diversification between
items in the same list. Our suggested metrics consider only single items, and thus could not be
used to evaluate entire lists. It would be interesting to see more evaluation metrics that provide
observations over complete lists.
Celma and Herrera (2008) suggest looking at topological properties of the recommendation
graph—the graph that connects recommended items. They explain how by looking at the recom-
mendation graph one may understand properties such as the novelty of recommendations. It is still
unclear how these properties correlate with the true goal of the recommender system, may it be to
optimize revenue or to recommend useful items.
McLaughlin and Herlocker (2004) argue, as we do, that MAE is not appropriate for evaluating
recommendation tasks, and that ratings are not necessarily indicative of whether a user is likely to
watch a movie. The last claim can be explained by the way we view implicit and explicit ratings.
2958
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
Some researchers have suggested taking a more holistic approach, and considering the recom-
mendation algorithm within the complete recommendation system. For example, del Olmo and
Gaudioso (2008), suggest that systems be evaluated only after deployment, through counting the
number of successful recommendations. As we argue above, even in these cases, one is likely to
evaluate algorithms offline, to avoid presenting recommendations that have poor quality for users,
thus losing their trust.
9. Conclusion
In this paper we discussed how recommendation algorithms should be evaluated in order to select
the best algorithm for a specific task from a set of candidates. This is an important step in the
research attempt to find better algorithms, as well as in application design where a designer chooses
an existing algorithm for their application. As such, many evaluation metrics have been used for
algorithm selection in the past.
We review three core tasks of recommendation systems—the prediction task, the recommenda-
tion task, and the utility maximization task. Most evaluation metrics are naturally appropriate for
one task, but not for the others. We discuss for each task a set of metrics that are most appropriate
for selecting the best of the candidate algorithms.
We empirically demonstrate that in some cases two algorithms can be ranked differently by two
metrics over the same data set, emphasizing the importance of choosing the appropriate metric for
the task, so as not to choose an inferior algorithm.
We also describe the concerns that need to be addressed when designing offline and online
experiments. We outline a few important measurements that one must take in addition to the score
that the metric provides, as well as other considerations that should be taken into account when
designing experiments for recommendation algorithms.
References
D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating
characteristic graph. Journal of Mathematical Psychology, 12:387–415, 1975.
F. Bodon. A fast APRIORI implementation. In The IEEE ICDM Workshop on Frequent Itemset
Mining Implementations, 2003.
D. Braziunas and C. Boutilier. Local utility elicitation in GAI models. In Proceedings of the
Twenty-first Conference on Uncertainty in Artificial Intelligence, pages 42–49, Edinburgh, 2005.
Ò. Celma and P. Herrera. A new approach to evaluating novel recommendations. In RecSys ’08:
Proceedings of the 2008 ACM Conference on Recommender Systems, 2008.
M. Claypool, P. Le, M. Waseda, and D. Brown. Implicit interest indicators. In Intelligent User
Interfaces, pages 33–40. ACM Press, 2001.
2959
G UNAWARDANA AND S HANI
F. Hernández del Olmo and E. Gaudioso. Evaluation of recommender systems: A new approach.
Expert Systems Applications, 35(3), 2008.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learn-
ing Research, 7, 2006.
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
C. Goutte and E. Gaussier. A probabilistic interpretation of precision, recall, and F-score, with
implication for evaluation. In ECIR ’05: Proceedings of the 27th European Conference on Infor-
mation Retrieval, pages 345–359, 2005.
A. Gunawardana and C. Meek. Aggregators and contextual effects in search ad markets. In WWW
Workshop on Targeting and Ranking for Online Advertising, 2008.
C. N. Hsu, H. H. Chung, and H. S. Huang. Mining skewed and sparse transaction data for person-
alized shopping recommendation. Machine Learning, 57(1-2), 2004.
R. Hu and P. Pu. A comparative user study on rating vs. personality quiz based preference elicitation
methods. In IUI ’09: Proceedings of the 13th International Conference on Intelligent User
Interfaces, 2009.
C. Kadie, C. Meek, and D. Heckerman. CFW: A collaborative filtering system using posteriors over
weights of evidence. In Proceedings of the 18th Annual Conference on Uncertainty in Artificial
Intelligence (UAI-02), pages 242–250, San Francisco, CA, 2002. Morgan Kaufmann.
D. Larocque, J. Nevalainen, and H. Oja. A weighted multivariate sign test for cluster-correlated
data. Biometrika, 94:267–283, 2007.
2960
A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS
M. R. McLaughlin and J. L. Herlocker. A collaborative filtering algorithm and evaluation metric that
accurately model the user experience. In SIGIR ’04: Proceedings of the 27th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, 2004.
S. Mcnee, S. K. Lam, C. Guetzlaff, J. A. Konstan, and J. Riedl. Confidence displays and training in
recommender systems. In Proceedings of the 9th IFIP TC13 International Conference on Human
Computer Interaction INTERACT, pages 176–183. IOS Press, 2003.
S. M. McNee, J. Riedl, and J. K. Konstan. Making recommendations better: an analytic model for
human-recommender interaction. In CHI ’06 Extended Abstracts on Human Factors in Comput-
ing Systems, 2006.
D. Oard and J. Kim. Implicit feedback for recommender systems. In The AAAI Workshop on
Recommender Systems, pages 81–83, 1998.
B. Price and P. Messinger. Optimal recommendation sets: Covering uncertainty over user prefer-
ences. In National Conference on Artificial Intelligence (AAAI), pages 541–548. AAAI Press
AAAI Press / The MIT Press, 2005.
P. Pu and L. Chen. Trust building with explanation interfaces. In IUI ’06: Proceedings of the 11th
International Conference on Intelligent User Interfaces, 2006.
P. Resnick and H. R. Varian. Recommender systems. Communications of the ACM, 40(3), 1997.
A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start
recommendations. In SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, 2002.
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal
Statistical Society B, 36(1):111–147, 1974.
2961
G UNAWARDANA AND S HANI
E. M. Voorhees. The philosophy of information retrieval evaluation. In CLEF ’01: Revised Papers
from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-
Language Information Retrieval Systems, 2002a.
E. M. Voorhees. Overview of trec 2002. In The 11th Text Retrieval Conference (TREC 2002), NIST
Special Publication 500-251, pages 1–15, 2002b.
2962