WWW 2005 Preprint
WWW 2005 Preprint
Topic Diversification
Cai-Nicolas Ziegler 1 Sean M. McNee 2
1
Institut fur
Informatik, Universitat
Freiburg
{cziegler, lausen}@informatik.uni-freiburg.de
{mcnee, konstan}@cs.umn.edu
General Terms
Algorithms, Experimentation, Human Factors, Measurement
Keywords
Collaborative filtering, diversification, accuracy, recommender systems, metrics
INTRODUCTION
Georg Lausen 1
Georges-K
ohler-Allee, Geb
aude Nr. 51
79110 Freiburg i.Br., Germany
ABSTRACT
1.
Joseph A. Konstan 2
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2005, May 10-14, 2005, Chiba, Japan.
ACM 1-59593-046-9/05/0005.
of collaborative filtering [26, 8, 11], and numerous commercial systems, e.g., Amazon.coms recommender [15], exploit
these techniques to offer personalized recommendation lists
to their customers.
Though the accuracy of state-of-the-art collaborative filtering systems, i.e., the probability that the active user1 will
appreciate the products recommended, is excellent, some implications affecting user satisfaction have been observed in
practice. Thus, on Amazon.com (http://www.amazon.com),
many recommendations seem to be similar with respect to
content. For instance, customers that have purchased many
of Hermann Hesses prose may happen to obtain recommendation lists where all top-5 entries contain books by
that respective author only. When considering pure accuracy, all these recommendations appear excellent since the
active user clearly appreciates books written by Hermann
Hesse. On the other hand, assuming that the active user
has several interests other than Hermann Hesse, e.g., historical novels in general and books about world travel, the
recommended set of items appears poor, owing to its lack of
diversity.
Traditionally, recommender system projects have focused
on optimizing accuracy using metrics such as precision/recall
or mean absolute error. Now research has reached the point
where going beyond pure accuracy and toward real user experience becomes indispensable for further advances [10].
This work looks specifically at impacts of recommendation
lists, regarding them as entities in their own right rather
than mere aggregations of single and independent suggestions.
1.1
Contributions
used for computing c(ai , aj ). The top-M most similar users aj become members of ai s neighborhood,
clique(ai ) A.
1.2
Organization
Our paper is organized as follows. We discuss collaborative filtering and its two most prominent implementations
in Section 2. The subsequent section then briefly reports on
common evaluation metrics and the new intra-list similarity
metric. In Section 4, we present our method for diversifying lists, describing its primary motivation and algorithmic
clockwork. Section 5 reports on our offline and online experiments with topic diversification and provides ample discussion of results obtained.
2.
ON COLLABORATIVE FILTERING
Collaborative filtering (CF) still represents the most commonly adopted technique in crafting academic and commercial [15] recommender systems. Its basic idea refers to making recommendations based upon ratings that users have
assigned to products. Ratings can either be explicit, i.e., by
having the user state his opinion about a given product, or
implicit, when the mere act of purchasing or mentioning of
an item counts as an expression of appreciation. While implicit ratings are generally more facile to collect, their usage
implies adding noise to the collected information [19].
2.1
2.2
0
be Bk
(c(bk , be ) ri (be ))
0
be Bk
|c(bk , be )|
(1)
where
Bk0 := {be | be clique(bk ) ri (bk ) 6= }
Intuitively, the approach tries to mimic real user behavior, having user ai judge the value of an unknown product
bk by comparing the latter to known, similar items be and
considering how much ai appreciated these be .
The eventual computation of a top-N recommendation
list Pwi follows the user-based CFs process, arranging recommendations according to wi in descending order.
3.
EVALUATION METRICS
Evaluation metrics are essential in order to judge the quality and performance of recommender systems, even though
they are still in their infancies. Most evaluations concentrate
on accuracy measurements only and neglect other factors,
e.g., novelty and serendipity of recommendations, and the
diversity of the recommended lists items.
The following sections give an outline of popular metrics.
An extensive survey of accuracy metrics is provided in [12].
3.1
Accuracy Metrics
3.1.1
3.2
bk Bi
(2)
3.1.2
Decision-Support Metrics
|Tix =Pix |
|Tix |
(3)
Symbol =Pix denotes the image of map Pix , i.e., all items
part of the recommendation list.
Accordingly, precision represents the percentage of test
set products b Tix occurring in Pix with respect to the size
of the recommendation list:
Precision = 100
|Tix
=Pix |
|=Pix |
(4)
Breese et al. [3] introduce an interesting extension to recall, known as weighted recall or Breese score. The approach
takes into account the order of the top-N list, penalizing in-
Beyond Accuracy
Though accuracy metrics are an important facet of usefulness, there are traits of user satisfaction they are unable to
capture. However, non-accuracy metrics have largely been
denied major research interest so far.
3.2.1
Coverage
3.2.2
3.3
Intra-List Similarity
We present a new metric that intends to capture the diversity of a list. Hereby, diversity may refer to all kinds of features, e.g., genre, author, and other discerning characteristics. Based upon an arbitrary function c : BB [1, +1]
measuring the similarity c (bk , be ) between products bk , be
according to some custom-defined criterion, we define intralist similarity for ai s list Pwi as follows:
X
bk =Pwi
X
be =Pwi , bk 6=be
c (bk , be )
(5)
2
Higher scores denote lower diversity. An interesting mathematical feature of ILS(Pwi ) we are referring to in later sections is permutation-insensitivity, i.e., let SN be the symmetric group of all permutations on N = |Pwi | symbols:
ILS(Pwi ) =
i , j SN : ILS(Pwi i ) = ILS(Pwi j )
(6)
4.
TOPIC DIVERSIFICATION
4.1
Function c : 2B 2B [1, +1], quantifying the similarity between two product sets, forms an essential part of
topic diversification. We instantiate c with our metric for
taxonomy-driven filtering [32], though other content-based
similarity measures may appear likewise suitable. Our metric computes the similarity between product sets based upon
their classification. Each product belongs to one or more
classes that are hierarchically arranged in classification taxonomies, describing the products in machine-readable ways.
Classification taxonomies exist for various domains. Amazon.com crafts very large taxonomies for books, DVDs, CDs,
electronic goods, and apparel. See Figure 1 for one sample taxonomy. Moreover, all products on Amazon.com bear
content descriptions relating to these domain taxonomies.
Featured topics could include author, genre, and audience.
4.2
Algorithm 1 shows the complete topic diversification algorithm, a brief textual sketch is given in the next paragraphs.
Function Pwi denotes the new recommendation list, resulting from applying topic diversification. For every list entry z [2, N ], we collect those products b from the candidate
products set Bi that do not occur in positions o < z in Pwi
and compute their similarity with set {Pwi (k) | k [1, z[ },
which contains all new recommendations preceding rank z.
Sorting all products b according to c (b) in reverse order,
we obtain the dissimilarity rank Pcrev
. This rank is then
merged with the original recommendation rank Pwi according to diversification factor F , yielding final rank Pwi .
Factor F defines the impact that dissimilarity rank Pcrev
exerts on the eventual overall output. Large F [0.5, 1] favors diversification over ai s original relevance order, while
low F [0, 0.5[ produces recommendation lists closer to
the original rank Pwi . For experimental analysis, we used
diversification factors F [0, 0.9].
Note that ordered input lists Pwi must be considerably
Pcrev
(b) |Bi0 | Pc1
(b);
1
1
wi (b) Pwi (b) (1 F ) + Pcrev
(b) F ;
end do
Pwi (z) min{wi (b) | b Bi0 };
end do
return Pwi ;
}
Algorithm 1: Sequential topic diversification
larger than the final top-N list. For our later experiments, we
used top-50 input lists for eventual top-10 recommendations.
4.3
Recommendation Dependency
4.4
The effect of dissimilarity bears traits similar to that of osmotic pressure and selective permeability known from molecular biology [30]. Steady insertion of products bo , taken from
one specific area of interest do , into the recommendation
list equates to the passing of molecules from one specific
substance through the cell membrane into cytoplasm. With
increasing concentration of do , owing to the membranes selective permeability, the pressure for molecules b from other
substances d rises. When pressure gets sufficiently high for
one given topic dp , its best products bp may diffuse into
the recommendation list, even though their original rank
Pw1
(b) might be inferior to candidates from the prevailing
i
domain do . Consequently, pressure for dp decreases, paving
the way for another domain for which pressure peaks.
Topic diversification hence resembles the membranes selective permeability, which allows cells to maintain their internal composition of substances at required levels.
5.
EMPIRICAL ANALYSIS
We conducted offline evaluations to understand the ramifications of topic diversification on accuracy metrics, and
online analysis to investigate how our method affects ac-
Books
Science
Archaelogy
Astronomy
Nonfiction
Reference
Medicine
Mathematics
Applied
Pure
Discrete
Sports
History
Algebra
5.1
Dataset Design
5.1.1
Data Collection
5.1.2
Condensation Steps
5.2
Offline Experiments
We performed offline experiments comparing precision, recall, and intra-list similarity scores for 20 different recommendation list setups. Half these recommendation lists were
based upon user-based CF with different degrees of diversification, the others on item-based CF. Note that we did
not compute MAE metric values since we are dealing with
implicit rather than explicit ratings.
5.2.1
5.2.2
Result Analysis
5.2.2.1
First, we analyzed precision and recall scores for both nondiversified base cases, i.e., when F = 0. Table 1 states that
2
Visit http://www-users.cs.umn.edu/karypis/suggest/.
Item-based CF
Item-based CF
4
User-based CF
User-based CF
Recall
Precision
4
2
1
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
(a)
(b)
80
90
Precision
Recall
Item-based CF
User-based CF
3.64
7.32
3.69
5.76
user-based and item-based CF exhibit almost identical accuracy, indicated by precision values. Their recall values differ
considerably, hinting at deviating behavior with respect to
the types of users they are scoring for.
Next, we analyzed the behavior of user-based and itembased CF when steadily increasing F by increments of 10%,
depicted in Figure 2. The two charts reveal that diversification has detrimental effects on both metrics and on both CF
algorithms. Interestingly, corresponding precision and recall
curves have almost identical shape.
The loss in accuracy is more pronounced for item-based
than for user-based CF. Furthermore, for either metric and
either CF algorithm, the drop is most distinctive for F
[0.2, 0.4]. For lower F , negative impacts on accuracy are
marginal. We believe this last observation due to the fact
that precision and recall are permutation-insensitive, i.e.,
the mere order of recommendations within a top-N list does
not influence the metric value, as opposed to Breese score [3,
12]. However, for low F , the pressure that the dissimilarity
rank exerts on the top-N lists makeup is still too weak to
make many new items diffuse into the top-N list. Hence, we
conjecture that rather the positions of current top-N items
change, which does not affect either precision or recall.
5.2.2.2
Intra-List Similarity.
Knowing that our diversification method exerts a significant, negative impact on accuracy metrics, we wanted to
know how our approach affected the intra-list similarity measure. Similar to the precision and recall experiments, we
computed metric values for user-based and item-based CF
with F [0, 0.9] each. Hereby, we instantiated the intralist similarity metric function c with our taxonomy-driven
metric c . Results obtained are provided in Figure 3(a).
The topic diversification method considerably lowers the
pairwise similarity between list items, thus making top-N
recommendation lists more diverse. Diversification appears
to affect item-based CF stronger than its user-based counterpart, in line with our findings about precision and recall.
For lower F , curves are less steep than for F [0.2, 0.4],
which also well aligns with precision and recall analysis.
Again, the latter phenomenon can be explained by one of
the metrics inherent features, i.e., like precision and recall,
intra-list similarity is permutation-insensitive.
5.2.2.3
Figure 3(b) shows the number of recommended items staying the same when increasing F with respect to the original
lists content. Both curves exhibit roughly linear shapes, being less steep for low F , though. Interestingly, for factors
F 0.4, at most 3 recommendations change on average.
5.2.2.4
Conclusion.
5.3
Online Experiments
Offline experiments helped us in understanding the implications of topic diversification on both CF algorithms. We
could also observe that the effects of our approach are different on different algorithms. However, knowing about the
deficiencies of accuracy metrics, we wanted to assess actual
user satisfaction for various degrees of diversification, thus
necessitating an online survey.
For the online study, we computed each recommendation
14
Item-based CF
Overlap with F = 0
12
Intra-List Similarity
Item-based CF
10
User-based CF
10
8
6
4
User-based CF
8
0
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
(a)
(b)
Figure 3: Intra-list similarity behavior (a) and overlap with original list (b) for increasing F
5.3.1
5.3.2
Result Analysis
5.3.2.1
Single-Vote Averages.
5.3.2.2
Covered Range.
3.6
3.4
Item-based CF
User-based CF
3.2
User-based CF
3.4
Covered Range
Single-Vote Averages
Item-based CF
2.8
2.6
3.2
2.8
2.4
2.6
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
(a)
(b)
3.6
Item-based CF
3.5
User-based CF
3.4
3.3
3.2
3.1
3
0
10
20
30
40
50
60
70
80
90
(c)
Figure 4: Results for single-vote averages (a), covered range of interests (b), and overall satisfaction (c)
illustrated before through the measurement of intra-list similarity. Users reactions to steadily incrementing F are illustrated in Figure 4(b). First, between both algorithms on
corresponding F levels, only the difference of means at
F = 0.3 shows statistical significance.
Studying the trend of user-based CF for increasing F , we
notice that the perceived range of reading interests covered
by users recommendation lists also increases. Hereby, the
curves first derivative maintains an approximately constant
level, exhibiting slight peaks between F [0.4, 0.5]. Statistical significance holds for user-based CF between means at
F = 0 and F > 0.5, and between F = 0.3 and F = 0.9.
On the contrary, the item-based curve exhibits a drastically different behavior. While soaring at F = 0.3 to 3.186,
reaching a score almost identical to the user-based CFs peak
at F = 0.9, the curve barely rises for F [0.4, 0.9],
remaining rather stable and showing a slight, though insignificant, upward trend. Statistical significance was shown
for F = 0 with respect to all other samples taken from
F [0.3, 0.9]. Hence, our online results do not perfectly
align with findings obtained from offline analysis. While the
intra-list similarity chart in Figure 3 indicates that diversity
increases when increasing F , the item-based CF chart defies this trend, first soaring then flattening. We conjecture
that the following three factors account for these peculiarities:
Diversification factor impact. Our offline analysis
of the intra-list similarity already suggested that the
effect of topic diversification on item-based CF is much
stronger than on user-based CF. Thus, the item-based
CFs user-perceived interest coverage is significantly
higher at F = 0.3 than the user-based CFs.
Human perception. We believe that human perception can capture the level of diversification inherent
to a list only to some extent. Beyond that point, increasing diversity remains unnoticed. For the application scenario at hand, Figure 4 suggests this point
around score value 3.2, reached by user-based CF only
at F = 0.9, and approximated by item-based CF already at F = 0.3.
Interaction with accuracy. Analyzing results obtained, bear in mind that covered range scores are not
fully independent from single-vote averages. When accuracy is poor, i.e., the user feels unable to identify
recommendations that are interesting to him, chances
are high his discontentment will also negatively affect
his diversity rating. For F [0.5, 0.9], single-vote averages are remarkably low, which might explain why
perceived coverage scores do not improve for increasing
F .
However, we may conclude that users do perceive the application of topic diversification as an overly positive effect
on reading interest coverage.
5.3.2.3
5.4
Results obtained from analyzing user feedback along various feature axes already indicated that users overall satisfaction with recommendation lists not only depends on accuracy, but also on the range of reading interests covered.
In order to more rigidly assess that indication by means of
statistical methods, we applied multiple linear regression to
our survey results, choosing the overall list value as dependent variable. As independent input variables, we provided
single-vote averages and covered range, both appearing as
first-order and second-order polynomials, i.e., SVA and CR,
and SVA2 and CR2 , respectively. We also tried several other,
more complex models, without achieving significantly better
model fitting.
Error
t-Value
P r(> |t|)
(const)
3.27
0.023
139.56
< 2e 16
SVA
SVA2
12.42
-6.11
0.973
0.976
12.78
-6.26
< 2e 16
4.76e 10
CR
CR2
19.19
-3.27
0.982
0.966
19.54
-3.39
< 2e 16
0.000727
Estimate
Analyzing multiple linear regression results, shown in Table 2, confidence values P r(> |t|) clearly indicate that statistically significant correlations for accuracy and covered
range with user satisfaction exist. Since statistical significance also holds for their respective second-order polynomials, i.e., CR2 and SVA2 , we conclude that these relationships
are non-linear and more complex, though.
As a matter of fact, linear regression delivers a strong indication that the intrinsic utility of a list of recommended
items is more than just the average value of accuracy votes
for all single items, but also depends on the perceived diversity.
6.
RELATED WORK
7.
CONCLUSION
We presented topic diversification, an algorithmic framework to increase the diversity of a top-N list of recommended
products. In order to show its efficiency in diversifying, we
also introduced our new intra-list similarity metric.
Contrasting precision and recall metrics, computed both
for user-based and item-based CF and featuring different
levels of diversification, with results obtained from a largescale user survey, we showed that the users overall liking
of recommendation lists goes beyond accuracy and involves
other factors, e.g., the users perceived list diversity. We were
thus able to provide empirical evidence that lists are more
than mere aggregations of single recommendations, but bear
an intrinsic, added value.
Though effects of diversification were largely marginal on
user-based CF, item-based CF performance improved significantly, an indication that there are some behavioral differences between both CF classes. Moreover, while pure itembased CF appeared slightly inferior to pure user-based CF in
overall satisfaction, diversifying item-based CF with factors
F [0.3, 0.4] made item-based CF outperform user-based
CF. Interestingly for F 0.4, no more than three items
tend to change with respect to the original list, shown in
Figure 3. Small changes thus have high impact.
We believe our findings especially valuable for practical
application scenarios, since many commercial recommender
systems, eg., Amazon.com [15] and TiVo [1], are item-based,
owing to the algorithms computational efficiency.
8.
FUTURE WORK
The problem of finding the right mix for sequential consumption-based recommenders takes us to another future direction worth exploring, namely individually adjusting the
right level of diversification versus accuracy tradeoff. One
approach could be to have the user himself define the degree of diversification he likes. Another approach might involve learning the right parameter from the users behavior,
e.g., by observing which recommended items he inspects and
devotes more time to, etc.
Finally, we are also thinking about diversity metrics other
than intra-list similarity. For instance, we envision a metric
that measures the extent to which the top-N list actually
reflects the users profile.
9.
ACKNOWLEDGEMENTS
10.
REFERENCES
[1] Ali, K., and van Stam, W. TiVo: Making show recommendations using a distributed collaborative filtering architecture. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(Seattle, WA, USA, 2004), ACM Press, pp. 394401.
, M., and Shoham, Y. Fab - content[2] Balabanovic
based, collaborative recommendation. Communications of
the ACM 40, 3 (March 1997), 6672.
[3] Breese, J., Heckerman, D., and Kadie, C. Empirical analysis of predictive algorithms for collaborative filtering. In
Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (Madison, WI, USA, July
1998), Morgan Kaufmann, pp. 4352.
[4] Cosley, D., Lawrence, S., and Pennock, D. REFEREE:
An open framework for practical testing of recommender systems using ResearchIndex. In 28th International Conference
on Very Large Databases (Hong Kong, China, August 2002),
Morgan Kaufmann, pp. 3546.
[5] Deshpande, M., and Karypis, G. Item-based top-n recommendation algorithms. ACM Transactions on Information
Systems 22, 1 (2004), 143177.
[6] Dwork, C., Kumar, R., Naor, M., and Sivakumar, D.
Rank aggregation methods for the Web. In Proceedings of the
Tenth International Conference on World Wide Web (Hong
Kong, China, 2001), ACM Press, pp. 613622.
[7] Fagin, R., Kumar, R., and Sivakumar, D. Comparing topk lists. In Proceedings of the Fourteenth Annual ACM-SIAM
Symposium on Discrete Algorithms (Baltimore, MD, USA,
2003), SIAM, pp. 2836.
[8] Goldberg, D., Nichols, D., Oki, B., and Terry, D. Using collaborative filtering to weave an information tapestry.
Communications of the ACM 35, 12 (1992), 6170.
[9] Good, N., Schafer, B., Konstan, J., Borchers, A., Sarwar, B., Herlocker, J., and Riedl, J. Combining collaborative filtering with personal agents for better recommendations. In Proceedings of the 16th National Conference on
Artificial Intelligence and Innovative Applications of Artificial Intelligence (Orlando, FL, USA, 1999), American Association for Artificial Intelligence, pp. 439446.
[10] Hayes, C., Massa, P., Avesani, P., and Cunningham, P.
An online evaluation framework for recommender systems.
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]