An Alternative Ranking Problem For Search Engines: 1 Motivation
An Alternative Ranking Problem For Search Engines: 1 Motivation
An Alternative Ranking Problem For Search Engines: 1 Motivation
Search Engines
Corinna Cortes1 , Mehryar Mohri2,1 , and Ashish Rastogi2
1
Google Research,
76 Ninth Avenue,
New York, NY 10011
Courant Institute of Mathematical Sciences,
251 Mercer Street
New York, NY 10012.
Abstract. This paper examines in detail an alternative ranking problem for search engines, movie recommendation, and other similar ranking systems motivated by the requirement to not just accurately predict
pairwise ordering but also preserve the magnitude of the preferences
or the difference between ratings. We describe and analyze several cost
functions for this learning problem and give stability bounds for their
generalization error, extending previously known stability results to nonbipartite ranking and magnitude of preference-preserving algorithms. We
present algorithms optimizing these cost functions, and, in one instance,
detail both a batch and an on-line version. For this algorithm, we also
show how the leave-one-out error can be computed and approximated
efficiently, which can be used to determine the optimal values of the
trade-off parameter in the cost function. We report the results of experiments comparing these algorithms on several datasets and contrast
them with those obtained using an AUC-maximization algorithm. We
also compare training times and performance results for the on-line and
batch versions, demonstrating that our on-line algorithm scales to relatively large datasets with no significant loss in accuracy.
Motivation
In most previous research studies, the problem of ranking has been formulated as that of learning from a labeled sample of pairwise preferences a scoring
function with small pairwise misranking error (Freund et al., 1998; Herbrich
et al., 2000; Crammer & Singer, 2001; Joachims, 2002; Rudin et al., 2005; Agarwal & Niyogi, 2005). But this formulation suffers some short-comings.
Firstly, most users inspect only the top results. Thus, it would be natural
to enforce that the results returned near the top be particularly relevant and
correctly ordered. The quality and ordering of the results further down the list
matter less. An average pairwise misranking error directly penalizes errors at
both extremes of a list more heavily than errors towards the middle of the list,
since errors at the extremes result in more misranked pairs. However, one may
wish to explicitly encode the requirement of ranking quality at the top in the cost
function. One common solution is to weigh examples differently during training
so that more important or high-quality results be assigned larger weights. This
imposes higher accuracy on these examples, but does not ensure a high-quality
ordering at the top. A good formulation of this problem leading to a convex
optimization problem with a unique minimum is still an open question.
Another shortcoming of the pairwise misranking error is that this formulation
of the problem and thus the scoring function learned ignore the magnitude of
the preferences. In many applications, it is not sufficient to determine if one
example is preferred to another. One may further request an assessment of how
large that preference is. Taking this magnitude of preference into consideration is
critical, for example in the design of search engines, which originally motivated
our study, but also in other recommendation systems. For a recommendation
system, one may choose to truncate the ordered list returned where a large gap
in predicted preference is found. For a search engine, this may trigger a search
in parallel corpora to display more relevant results.
This motivated our study of the problem of ranking while preserving the
magnitude of preferences, which we will refer to in short by magnitude-preserving
ranking.3 The problem that we are studying bears some resemblance with that
of ordinal regression (McCullagh, 1980; McCullagh & Nelder, 1983; Shashua &
Levin, 2003; Chu & Keerthi, 2005). It is however distinct from ordinal regression
since in ordinal regression the magnitude of the difference in target values is
not taken into consideration in the formulation of the problem or the solutions
proposed. The algorithm of Chu and Keerthi (2005) does take into account
the ordering of the classes by imposing that the thresholds be monotonically
increasing, but this still ignores the difference of target values and thus does not
follow the same objective. A crucial aspect of the algorithms we propose is that
they penalize misranking errors more heavily in the case of larger magnitudes of
preferences.
We describe and analyze several cost functions for this learning problem and
give stability bounds for their generalization error, extending previously known
stability results to non-bipartite ranking and magnitude of preference-preserving
algorithms. In particular, our bounds extend the framework of (Bousquet &
3
Elisseeff, 2000; Bousquet & Elisseeff, 2002) to the case of cost functions over
pairs of examples, and extend the bounds of Agarwal and Niyogi (2005) beyond
the bi-partite ranking problem. Our bounds also apply to algorithms optimizing
the so-called hinge rank loss.
We present several algorithms optimizing these cost functions, and in one instance detail both a batch and an on-line version. For this algorithm, MPRank,
we also show how the leave-one-out error can be computed and approximated
efficiently, which can be used to determine the optimal values of the trade-off parameter in the cost function. We also report the results of experiments comparing
these algorithms on several datasets and contrast them with those obtained using RankBoost (Freund et al., 1998; Rudin et al., 2005), an algorithm designed
to minimize the exponentiated loss associated with the Area Under the ROC
Curve (AUC), or pairwise misranking. We also compare training times and performance results for the on-line and batch versions of MPRank, demonstrating
that our on-line algorithm scales to relatively large datasets with no significant
loss in accuracy.
The remainder of the paper is organized as follows. Section 2 describes and
analyzes our algorithms in detail. Section 3 presents stability-based generalization bounds for a family of magnitude-preserving algorithms. Section 4 presents
the results of our experiments with these algorithms on several datasets.
Algorithms
(1)
For any i [1, m], we denote by S i the sample derived from S by omitting
example (xi , yi ), and by S i the sample derived from S by replacing example
(xi , yi ) with an other example (xi , yi ) drawn i.i.d. from X according to D. For
convenience, we will sometimes denote by yx = yi the label of a point x = xi X.
The quality of the ranking algorithms we consider is measured with respect
to pairs of examples. Thus, a cost functions c takes as arguments two sample
b S) of a hypothesis
points. For a fixed cost function c, the empirical error R(h,
h : X 7 R on a sample S is defined by:
m X
m
X
b S) = 1
c(h, xi , xj ).
R(h,
m2 i=1 j=1
(2)
(3)
2.1
Cost functions
n
cHR (h, x, x ) =
(4)
(h(x ) h(x)) , otherwise.
cnHR does not take into consideration the true magnitude of preference yx yx
for each pair (x, x ) however. The following cost function has this property and
penalizes deviations of the predicted magnitude with respect to the true one.
Thus, it matches our objective of magnitude-preserving ranking (n = 1, 2):
n
(5)
cnMP (h, x, x ) = (h(x ) h(x)) (yx yx ) .
A one-sided version of that cost function penalizing only misranked pairs is given
by (n = 1, 2):
0, if (h(x ) h(x))(yx yx) 0
cnHMP (h, x, x ) =
(6)
n
(h(x ) h(x)) (yx yx ) , otherwise.
Finally, we will consider the following cost function derived from the -insensitive
cost function used in SVM regression (SVR) (Vapnik, 1998) (n = 1, 2):
0,
if | (h(x ) h(x)) (yx yx ) |
n
cSVR (h, x, x ) =
(7)
n
(h(x ) h(x)) (yx yx ) , otherwise.
Note that all of these cost functions are convex functions of h(x) and h(x ).
2.2
Objective functions
The regularization algorithms based on the cost functions cnMP and cnSVR correspond closely to the idea of preserving the magnitude of preferences since these
cost functions penalize deviations of a predicted difference of score from the
target preferences. We will refer by MPRank to the algorithm minimizing the
regularization-based objective function based on cnMP :
F (h, S) = khk2K + C
m m
1 XX n
c (h, xi , xj ),
m2 i=1 j=1 MP
(8)
m m
1 XX n
c
(h, xi , xj ).
m2 i=1 j=1 SVR
(9)
For a fixed n, n = 1, 2, the same stability bounds hold for both algorithms as seen
in the following section. However, their time complexity is significantly different.
2.3
MPRank
F (h, S) = kwk2 + C
MX = [Mx1 . . . Mxm ] MX = [M . . . M ].
(10)
2C
2C
2
2
kM
kM
W MY k .
X W MY k
X
m
m
(MX
W MY ). Setting F = 0 yields the unique closed form solution
m MX
of the convex optimization problem:
1
W = C I + C (MX MX )(MX MX )
(MX MX )(MY MY ), (11)
where C = 2C
MX
= (MX
m . Here, we are using the identity MX MX MX
MX )(MX MX ) , which is not hard to verify. This provides the solution of the
1
primal problem. Using the fact the matrices (I+C (MX MX )(MX MX )
and MX MX commute leads to:
1
W = C (MX MX ) I + C (MX MX )(MX MX )
(MY MY ). (12)
This helps derive the solution of the dual problem. For any x X,
1 (MY M ),
h(x ) = C K (I + K)
Y
(13)
K(x , xj )
1 X
K(x , xk )
m
k=1
(14)
k=1 l=1
for all i, j [1, m]. The solution of the optimization problem for MPRank is close
to that of a kernel ridge regression problem, but the presence of additional terms
makes it distinct, a fact that can also be confirmed experimentally. However,
remarkably, it has the same computational complexity, due to the fact that the
optimization problem can be written in terms of a single sum, as already pointed
out above. The main computational cost of the algorithm is that of the matrix
inversion, which can be computed in time O(N 3 ) in the primal, and O(m3 ) in
the dual case, or O(N 2+ ) and O(m2+ ), with .376, using faster matrix
inversion methods such as that of Coppersmith and Winograd.
2.4
SVRank
We will examine the algorithm in the case n = 1. As with MPRank, the hypothesis set H that we are considering here is that of linear functions h, that is
x X, h(x) = w (x). The constraint optimization problem associated with
SVRank can thus be rewritten as
m m
1 XX
(ij + ij
)
minimize F (h, S) = kwk2 + C 2
m i=1 j=1
ij , ij 0,
for all i, j [1, m]. Note that the number of constraints are quadratic with
respect to the number of examples. Thus, in general, this results in a problem
that is more costly to solve than that of MPRank.
Introducing Lagrange multipliers ij , ij 0, corresponding to the first two
i=1 j=1
m X
m
X
i=1 j=1
m
m X
X
i=1 j=1
m
m X
X
(ij + ij
)+
i=1 j=1
Taking the gradients, setting them to zero, and applying the Karush-KuhnTucker conditions leads to the following dual maximization problem
m
m
1 X X
maximize
(ij ij )(kl kl )Kij,kl
2 i,j=1
k,l=1
m
m
X
X
(ij ij )(yj yi )
(ij ij ) +
i,j=1
i,j=1
Recall from Section 2.3 that the cost function for MPRank can be written as
F (h, S) = kwk2 +
m
2C X
y) 2 .
(w (xi ) yi )2 (w
m i=1
(15)
This expression suggests that the solution w can be found by solving the following
optimization problem
m
2C X 2
minimize F = kwk +
w
m i=1 i
subject to (w (xi ) yi ) (w y) = i for i = 1, . . . , m
2
Introducing the Lagrange multipliers i corresponding to the ith equality constraint leads to the following Lagrange function:
m
m
2C X 2 X
i (w (xi ) yi ) (w y) i
i
m i=1
i=1
1 Pm
and setting L/i = 0
Setting L/w = 0, we obtain w = 2 i=1 i ((xi ) ),
m
i . Substituting these expression backs in and letting i = i /2
leads to i = 4C
result in the optimization problem
L(w, , ) = kwk2 +
maximize
i
m X
m
X
m
m
X
X
e i , xj ) m
i j K(x
i yei ,
2i + 2
2C i=1
i=1 j=1
i=1
(16)
m
m
1 X
1 X
e
(K(xi , xk )+K(xj , xk ))+ 2
where K(xi , xj ) = K(xi , xj )
K(xk , xl )
m
m
k=1
and yei = yi y.
k,l=1
Based on the expressions for the partial derivatives of the Lagrange function,
we can now describe a gradient descent algorithm that avoids the prohibitive
complexity of MPRank that is associated with matrix inversion:
1 for i 1 to m do i 0
2 repeat
3
for i 1 to m
Pm
e i , xj ))
4
do i i + 2(yei j=1 j K(x
5 until convergence
m
C i
m
m
X
X
1
c(hij , xi , xj ).
m(m 1) i=1
(17)
j=1,i6=j
The following proposition shows that with our new definition, the fundamental
property of LOO is preserved.
Proposition 1. Let m 2 and let h be the hypothesis returned by L when
trained over a sample S of size m 2. Then, the leave-one-out error over a
sample S of size m is an unbiased estimate of the true error over a sample of
size m 2:
ESD [LOO(L, S)] = R(h ),
(18)
Proof. Since all points of S are drawn i.i.d. and according to the same distribution D,
ESD [LOO(L, S)] =
=
1
m(m 1)
1
m(m 1)
=E
m
X
i,j=1,i6=j
m
X
ESD [c(hij , xi , xj )]
x,x S,x6=x
SD,x,x S
(19)
[c(hxx , x, x )]
(21)
This last term coincides with ES ,x,x D,|S |=m2 [c(hxx , x, x )] = R(h ).
In Section 2.3, it was shown that the hypothesis returned by MPRank for a
1 (MY MY ) for all x MX . Let
sample S is given by h(x ) = C K (I + K)
Kc be the matrix derived from K byPreplacing each entry Kij of K by the sum
of the entries in the same column m
j=1 Kij . Similarly, let Kr be the matrix
derived from K by replacing each entry of K by the sum of the entries in the
same row, and let Krc be the matrix whose entries all are equal to the sum of
can be written as:
all entries of K. Note that the matrix K
1
1
1
K = K (Kr + Kc ) + 2 Krc .
C
m
m
(22)
2
(1 Vjj )(1 Vii ) Vij Vji c2MP (h , xi , xj ) =
[(1 Vii Vij )(Vjj + Vji )(1 Vji Vjj )(Vii + Vij )](h y)
(24)
2
Proof. By Equation 15, the cost function of MPRank can be written as:
F = kwk2 +
m
2C X
y) 2 ,
(h(xk ) yk ) (h
m
(25)
k=1
terms but select new values for yi and yj to ensure that these terms are zero.
Proceeding this way, the new values yi and yj must verify the following:
h (xi ) yi = h (xj ) yj = h y ,
(26)
P
1
with y = m
[y (xi ) + y (xj ) + k6{i,j} yk ]. Thus, by Equation 23, h (xi ) is given
Pm
by h (xi ) = k=1 Uik (yk y ). Therefore,
h (xi ) yi =
k6{i,j}
Uik (yk y)
k6{i,j}
k6{i,j}
+Uij (h (xj ) h ) yi
X
Uik (y y)
= (h(xi ) yi ) Uii (yi y) Uij (yj y)
k6{i,j}
k6{i,j}
m2
k6{i,j}
Thus,
(1 Vii )(h (xi ) yi ) Vij (h (xj ) yj ) = (h(xi ) yi ) (Vii + Vij )(h y),
Similarly, we have
Vji (h (xi ) yi ) + (1 Vjj )(h (xj ) yj ) = (h(xj ) yj ) (Vjj + Vji )(h y).
Solving the linear system formed by these two equations with unknown variables
(h (xi ) yi ) and (h (xj ) yj ) gives:
(1 Vjj )(1 Vii ) Vij Vji (h (xi ) yi ) = (1 Vjj )(h(xi ) yi ) + Vij (h(xj ) yj )
[(Vii + Vij )(1 Vjj ) + (Vjj + Vji )Vij ](h y).
Similarly, we obtain:
(1 Vjj )(1 Vii ) Vij Vji (h (xj ) yj ) = Vji (h(xi ) yi ) + (1 Vii )(h(xj ) yj )
[(Vii + Vij )Vji + (Vjj + Vji )(1 Vii )](h y).
Taking the difference of these last two equations and squaring both sides yields
the expression of c2MP (h , xi , xj ) given in the statement of the proposition.
Given h , Proposition 2 and Equation 17 can be used to compute the leave-oneout error of h efficiently, since the coefficients Uij can be obtained in time O(m2 )
1 already computed to determine h.
from the matrix (I + K)
Note that by the results of Section 2.3 and the strict convexity of the objective
function, h is uniquely determined and has a closed form. Thus, unless
the
points xi and xj coincide, the expression [(1 Vjj )(1 Vii ) Vij Vji factor of
c2MP (h , xi , xj ) cannot be null. Otherwise, the system of linear equations found
in the proof is reduced to a single equation and h (xi ) (or h (xj )) is not uniquely
specified.
For larger values of m, the average value of h over the sample S should not
Using this
be much different from that of h , thus we can approximate h by h.
approximation, for a sample with distinct points, we can write for L =MPRank
"
X (1 Vii Vij )(h(xj ) yj ) (1 Vji Vjj )(h(xi ) yi )
1
LOO(L, S)
m(m1) i6=j
(1 Vjj )(1 Vii ) Vij Vji
#2
[(1 Vii Vij )(Vjj + Vji ) (1 Vji Vjj )(Vii + Vij )]
(h y) .
(1 Vjj )(1 Vii ) Vij Vji
This can be used to determine efficiently the best value of the parameter C based
on the leave-one-out error.
or each row of K is
Observe that the sum of the entries of each row of K
m1
zero. Let M1 R
be column matrix with all entries equal to 1. In view of
1 = 0, thus (I + K)M
1 M1 = M1 , and
this observation, KM
1 = M1 , (I + K)
1
k=1
Vik (yk y) =
m
X
k=1
Vik (yk y) =
m1
h(xi ).
m2
(28)
These identities further simplify the expression of matrix V and its relationship
with h.
Stability bounds
Bousquet and Elisseeff (2000) and Bousquet and Elisseeff (2002) gave stability
bounds for several regression and classification algorithms. This section shows
similar stability bounds for ranking and magnitude-preserving ranking algorithms. This also generalizes the results of Agarwal and Niyogi (2005) which
were given in the specific case of bi-partite ranking.
The following definitions are natural extensions to the case of cost functions
over pairs of those given by Bousquet and Elisseeff (2002).
Definition 1. A learning algorithm L is said to be uniformly -stable with respect to the sample S and cost function c if there exists 0 such that for all
S (X R)m and i [1, m],
x, x X, |c(hS , x, x ) c(hS i , x, x )| .
(29)
(30)
with h = h h.
3.1
For a cost function c such as those just defined and a regularization function N ,
a regularization-based algorithm can be defined as one minimizing the following
objective function:
F (h, S) = N (h) + C
m m
1 XX
c(h, xi , xj ),
m2 i=1 j=1
(31)
Assuming that for all x X, K(x, x) 2 for some constant 0, the inequality becomes: x X, |h(x)| khkK . With the cost functions previously
discussed, the objective function F is then strictly convex and the optimization
problem admits a unique solution. In what follows, we will refer to the algorithms minimizing the objective function F with a cost function defined in the
previous section as magnitude-preserving regularization algorithms.
Lemma 1. Assume that the hypotheses in H are bounded, that is for all h H
and x S, |h(x) yx | M . Then, the cost functions cnHR , cnMP , cnHMP , and cnSVR
are all n -admissible with 1 = 1, 2 = 4M .
Proof. We will give the proof in the case of cnMP , n = 1, 2, the other cases can
be treated similarly.
By definition of c1MP , for all x, x X,
|c1MP (h , x, x ) c1MP (h, x, x )| = |(h (x ) h (x)) (yx yx )|
(33)
Using the identity |X Y | |X Y | |X X|, valid for all X, X , Y R,
it follows that
|c1MP (h , x, x ) c1MP (h, x, x )| |h(x ) h(x)|
|h(x )| + |h(x)|,
(34)
(35)
which shows the -admissibility of c1MP with = 1. For c2MP , for all x, x X,
|c2MP (h , x, x ) c2MP (h, x, x )| = ||(h (x ) h (x)) (yx yx )|2
|(h(x ) h(x)) (yx yx )|2 |
(36)
(37)
|h(x ) h(x)|(|h (x ) yx | +
(38)
Proposition 3. Assume that the hypotheses in H are bounded, that is for all
h H and x S, |h(x) yx | M . Then, a magnitude-preserving regularization
4C2 2
algorithm as defined above is -stable with = mn .
Proof. Fix the cost function to be c, one of the n -admissible cost function
previously discussed. Let hS denote the function minimizing F (h, S) and hS k
the one minimizing F (h, S k ). We denote by hS = hS k hS .
b S) is
Since the cost function c is convex with respect to h(x) and h(x ), R(h,
also convex with respect to h and for t [0, 1],
h
i
b S + thS , S k ) R(h
b S , S k ) t R(h
b S k , S k ) R(h
b S , S k ) .
R(h
(39)
Similarly,
h
i
b S k thS , S k ) R(h
b S k , S k ) t R(h
b S , S k ) R(h
b S k , S k ) . (40)
R(h
b S + thS , S k ) R(h
b S , S k ) + R(h
b S k thS , S k ) R(h
b S k , S k ) 0. (41)
R(h
(43)
b S , S) R(h
b S , S k )+ R(h
b S + thS , S k ) R(h
b S + thS , S) .
with A = C R(h
Since
X
c(hS , xi , xk ) c(hS + thS , xi , xk )+
A = mC2
i6=k
X
i6=k
c(hS , xk , xi ) c(hS + thS , xk , xi ) ,
(44)
by the n -admissibility of c,
|A|
2Ctn X
4Ctn
khS kK .
(|hS (xk )| + |hS (xi )|)
m2
m
i6=k
Using the fact that khk2K = hh, hi for any h, it is not hard to show that
khS k2K khS + thS k2K + khS k k2K khS k thS k2K = 2t(1 t)khS k2K .
In view of this and the inequality for |A|, Inequality 43 implies 2t(1t)khS k2K
4Ctn
khS kK , that is after dividing by t and taking t 0,
m
khS kK
2Cn
.
m
(45)
.
m
2 2
4Cn
.
m
(46)
(47)
(48)
Theorem 1. Let c be any of the cost functions defined in Section 2.1. Let L be
a uniformly -stable algorithm with respect to the sample S and cost function c
and let hS be the hypothesis returned by L. Assume that the hypotheses in H are
bounded, that is for all h H, sample S, and x S, |h(x) yx | M . Then,
for any > 0,
Pr
SD
i
m2
b S )| > + 2 2e 2(m+(2M )n )2 .
|R(hS ) R(h
(49)
Since the sample points in S are drawn in an i.i.d. fashion, for all i, j [1, m],
m X
m
X
b S , S)] = 1
ES [R(h
E[c(hS , xi , xj )]
m2 i=1 j=1
(50)
= ESD [c(hS , xi , xj )]
= ES i,j D [c(hS i,j , xi , xj )]
(51)
(52)
(53)
Note that by definition of R(hS ), ES [R(hS )] = ES,xi ,xj D [c(hS , xi , xj )]. Thus,
ES [(S)] = ES,x,x [c(hS , xi , xj ) c(hS i,j , xi , xj )], and by -stability (Proposition 3)
| ES [(S)]| ES,x,x [|c(hS , xi , xj ) c(hS i , xi , xj )|] +
ES,x,x [|c(hS i , xi , xj ) c(hS i,j , xi , xj )|]
2.
(54)
(55)
(56)
Now,
|R(hS ) R(hS k )| = | ES [c(hS , x, x ) c(hS k , x, x )]|
(57)
(58)
(59)
m
1 X
|c(hS , xk , xj ) c(hS k , xk , xj )| +
m2 j=1
(61)
m
1 X
|c(hS , xk , xj ) c(hS k , xi , xk )|
m2 i=1
(62)
m
1
(m2 ) + 2 2(2M )n = + 2(2M )n /m.
m2
m
(63)
Thus,
|(S) (S k )| 2( + (2M )n /m),
(64)
The following Corollary gives stability bounds for the generalization error of
magnitude-preserving regularization algorithms.
Corollary 1. Let L be a magnitude-preserving regularization algorithm and let
c be the corresponding cost function and assume that for all x X, K(x, x) 2 .
Assume that the hypothesis set H is bounded, that is for all h H, sample S,
and x S, |h(x) yx | M . Then, with probability at least 1 ,
for n = 1,
for n = 2,
r
2
8
C
2
2
2
b S) +
R(hS ) R(h
+ 2(2 C + M )
log ;
m
m
r
2
2
1282 CM 2
2
2
b
+ 4M (16 C + 1)
log .
R(hS ) R(hS ) +
m
m
2 2
4Cn
.
m
(65)
(66)
for values of C m.
Experiments
In this section, we report the results of experiments with two of our magnitudepreserving algorithms, MPRank and SVRank.
The algorithms were tested on four publicly available data sets, three of which
are commonly used for collaborative filtering: MovieLens, Book-Crossings, and
Jester Joke. The fourth data set is the Netflix data. The first three datasets are
available from the following URL:
http://www.grouplens.org/taxonomy/term/14.
The Netflix data set is available at
http://www.netflixprize.com/download.
4.1
MovieLens Dataset
Data set
MovieLens
20-40
2.01
0.02
2.43
0.13
12.88
2.15
1.04
0.05
1.17
0.03
2.59
0.04
MovieLens
40-60
2.02
0.06
2.36
0.16
20.06
2.76
1.04
0.02
1.15
0.07
2.99
0.12
MovieLens
60-80
2.07
0.05
2.66
0.09
21.35
2.71
1.06
0.01
1.24
0.02
3.82
0.23
Jester
20-40
51.34
2.90
55.00
5.14
77.08
17.1
5.08
0.15
5.40
0.20
5.97
0.16
Jester
40-60
46.77
2.03
57.75
5.14
80.00
18.2
4.98
0.13
5.27
0.20
6.18
0.11
Jester
60-80
49.33
3.11
56.06
4.26
88.61
18.6
4.88
0.14
5.25
0.19
6.46
0.20
Netflix
Density:32%
1.58
0.04
1.80
0.05
57.5
7.8
0.92
0.01
0.95
0.02
6.48
0.55
Netflix
Density:46%
1.55
0.03
1.90
0.06
23.9
2.9
0.95
0.01
1.02
0.02
4.10
0.23
Netflix
Density:58%
1.49
0.03
1.93
0.06
12.33
1.47
0.94
0.01
1.06
0.02
3.01
0.15
Books
4.00
3.12
3.64
3.04
7.58
9.95
1.38
0.60
1.32
0.56
1.72
1.05
process was then repeated ten times with a different set of 300 reviewers selected
at random. We report mean values and standard deviation for these ten repeated
experiments for each of the three groups. Missing review values in the input
features were populated with the median review score of the given reference
reviewer.
4.2
The Jester Joke Recommender System dataset contains 4.1M continuous ratings
in the range -10.00 to +10.00 of 100 jokes from 73,496 users. The experiments
were set up in the same way as for the MovieLens dataset.
4.3
Netflix Dataset
The Netflix dataset contains more than 100M ratings by 480,000 users for 17,700
movies. Ratings are integers in the range of 1 to 5. We constructed three subsets
of the data with different user densities. Subsets were obtained by thresholding
against two parameters: the minimum number of movies rated by a user and
the minimum of ratings for a movie. Thus, in choosing users for the training
and testing set, we only consider those users who have reviewed more than 150,
500, or 1500 movies respectively. Analogously, in selecting the movies that would
appear in the subset data, we only consider those movies that have received at
least 360, 1200, or 1800 reviews. The experiments were then set-up in the same
way as for the MovieLens dataset. The mean densities of the three subsets (across
the ten repetitions) were 32%, 46% and 58% respectively. Finally, the test raters
were selected from a mixture of the three densities.
4.4
Book-Crossing Dataset
The book-crossing dataset contains 1,149,780 ratings for 271,379 books for a
group of 278,858 users. The low density of ratings makes predictions very noisy
in this task. Thus, we required users to have reviewed at least 200 books, and
then only kept books with at least 10 reviews. This left us with a dataset of 89
books and 131 reviewers. For this dataset, each of the 131 reviewers was in turn
selected as a test reviewer, and the other 130 reviewers served as input features.
The results reported are mean values and standard deviations over these 131
leave-one-out experiments.
4.5
m m
1 XX
2
((h(xj ) h(xi )) (yj yi )) .
m2 i=1 j=1
(67)
The cost function of SVRank minimizes the absolute value of the difference between all pairs of examples, hence we report the average of the 1-norm difference,
M1D:
m m
1 XX
|(h(xj ) h(xi )) (yj yi )| .
(68)
m2 i=1 j=1
The results for MPRank and SVRank are obtained using Gaussian kernels. The
width of the kernel and the other cost function parameters were first optimized
on a held-out sample. The performance on their respective cost functions was
optimized and the parameters fixed at these values.
The results are reported in Table 1. They demonstrate that the magnitudepreserving algorithms are both successful at minimizing their respective objective. MPRank obtains the best MSD values and the two algorithms obtain comparable M1D values. However, overall, in view of these results and the superior
Data set
Pairwise Misrankings
MPRank
RBoost
MovieLens
40-60
0.471
0.005
0.476
0 0.007
MovieLens
0.442
0.463
60-80
0.005
0.011
Jester
0.414
0.479
20-40
0.005
0.008
Jester
0.418
0.432
40-60
0.007
0.005
Netflix
0.433
0.447
Density:32%
0.018
0.027
Netflix
0.368
0.327
Density:46%
0.014
0.008
Netflix
0.295
0.318
Density:58%
0.006
0.008
i,j=1
(69)
The results show that the pairwise misranking error of MPRank is comparable to
that of RankBoost. This further increases the benefits of MPRank as a ranking
algorithm.
1.9
1.8
1.7
1.6
1.5
1.4
20.0
10.0
5.0
online
batch
2.0
1.0
0.5
20
40
60
80
120
Training rounds
(a)
600
800
1000
1200
(b)
Fig. 1. (a) Convergence of the on-line learning algorithm towards the batch solution.
Rounding errors give rise to slightly different solutions. (b) Training time in seconds
for the on-line and the batch algorithm. For small training set sizes the batch version
is fastest, but for larger training set sizes the on-line version is faster. Eventually the
batch version becomes infeasible.
4.6
Using the Netflix data we also experimented with the on-line version of MPRank
described in Section 2.5. The main questions we wished to investigate were the
convergence rate and CPU time savings of the on-line version with respect to the
batch algorithm MPRank (Equation 13). The batch solution requires a matrix
inversion and becomes infeasible for large training sets.
Figure 1(a) illustrates the convergence rate for a typical reviewer. In this
instance, the training and test sets each consisted of about 700 movies. As can
be seen from the plot, the on-line version converges to the batch solution in
about 120 rounds, where one round is a full cycle through the training set.
Based on monitoring several convergence plots, we decided on terminating
learning in the on-line version of MPRank when consecutive rounds of iterations
over the full training set would change the cost function by less than .01 %.
Figure 1(b) compares the CPU time for the on-line version of MPRank with the
batch solution. For both computations of the CPU times, the time to construct
the Gram matrix is excluded. The figure shows that the on-line version is signifi-
cantly faster for large datasets, which extends the applicability of our algorithms
beyond the limits of intractable matrix inversion.
Conclusion
Acknowledgments
The work of Mehryar Mohri and Ashish Rastogi was partially funded by the
New York State Office of Science Technology and Academic Research (NYSTAR). This project was also sponsored in part by the Department of the Army
Award Number W81XWH-04-1-0307. The U.S. Army Medical Research Acquisition Activity, 820 Chandler Street, Fort Detrick MD 21702-5014 is the awarding and administering acquisition office. The content of this material does not
necessarily reflect the position or the policy of the Government and no official
endorsement should be inferred.
Bibliography
Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking
algorithms. Proceedings of COLT 2005.
Bousquet, O., & Elisseeff, A. (2000). Algorithmic stability and generalization
performance. Advances in Neural Information Processing Systems (NIPS
2000).
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. J. Mach.
Learn. Res., 2, 499526.
Chu, W., & Keerthi, S. S. (2005). New approaches to support vector ordinal regression. Proceedings of the 22nd International Conference on Machine
Learning (pp. 145152). New York, NY, USA: ACM Press.
Cortes, C., & Mohri, M. (2004). AUC Optimization vs. Error Rate Minimization.
Advances in Neural Information Processing Systems (NIPS 2003). Vancouver,
Canada: MIT Press.
Cortes, C., Mohri, M., & Rastogi, A. (2007). Magnitude-preserving ranking algorithms (Technical Report TR-2007-887). Courant Institute of Mathematical
Sciences, New York University.
Crammer, K., & Singer, Y. (2001). Pranking with ranking. Advances in Neural
Information Processing Systems (NIPS 2001).
Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (1998). An efficient boosting
algorithm for combining preferences. Proceedings of the 15th International
Conference on Machine Learning (pp. 170178). Madison, US: Morgan Kaufmann Publishers, San Francisco, US.
Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In Smola, Bartlett, Schoelkopf and Schuurmans
(Eds.), Advances in large margin classifiers, 115132. MIT Press, Cambridge,
MA.
Joachims, T. (2002). Evaluating retrieval performance using clickthrough data.
McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal
Statistical Society B, 42.
McCullagh, P., & Nelder, J. A. (1983). Generalized linear models. Chapman &
Hall, London.
McDiarmid, C. (1998). Concentration. Probabilistic Methods for Algorithmic
Discrete Mathematics (pp. 195248).
Netflix (2006). Netflix prize. http://www.netflixprize.com/.
Rudin, C., Cortes, C., Mohri, M., & Schapire, R. E. (2005). Margin-Based
Ranking Meets Boosting in the Middle. Proceedings of COLT 2005 (pp. 63
78). Springer, Heidelberg, Germany.
Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two
approaches. Advances in Neural Information Processing Systems (NIPS 2003).
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley-Interscience.
Wahba, G. (1990). Spline models for observational data. SIAM [Society for
Industrial and Applied Mathematics].