0% found this document useful (0 votes)
55 views10 pages

Learning To Rank With Bregman Divergences and Monotone Retargeting

The document presents a novel approach for learning to rank (LETOR) called monotone retargeting. It involves minimizing a divergence between all monotonic increasing transformations of the training scores and a parameterized prediction function. This is applied using Bregman divergences, which were shown to be statistically consistent with the NDCG criterion. The algorithm uses alternating projection updates where one set of projections can be computed independently of the divergence and the other reduces to parameter estimation of a generalized linear model, resulting in an efficiently parallelizable algorithm that enjoys global optimum guarantees. Empirical results on benchmark datasets show it can outperform state-of-the-art NDCG consistent techniques.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views10 pages

Learning To Rank With Bregman Divergences and Monotone Retargeting

The document presents a novel approach for learning to rank (LETOR) called monotone retargeting. It involves minimizing a divergence between all monotonic increasing transformations of the training scores and a parameterized prediction function. This is applied using Bregman divergences, which were shown to be statistically consistent with the NDCG criterion. The algorithm uses alternating projection updates where one set of projections can be computed independently of the divergence and the other reduces to parameter estimation of a generalized linear model, resulting in an efficiently parallelizable algorithm that enjoys global optimum guarantees. Empirical results on benchmark datasets show it can outperform state-of-the-art NDCG consistent techniques.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learning to Rank With Bregman Divergences and Monotone Retargeting

Sreangsu Acharyya Dept. Electrical Engineering University of Texas Austin

Oluwasanmi Koyejo Dept. Electrical Engineering University of Texas Austin

Joydeep Ghosh Dept. Electrical Engineering University of Texas Austin

Abstract
This paper introduces a novel approach for learning to rank (LETOR) based on the notion of monotone retargeting. It involves minimizing a divergence between all monotonic increasing transformations of the training scores and a parameterized prediction function. The minimization is both over the transformations as well as over the parameters. It is applied to Bregman divergences, a large class of distance like functions that were recently shown to be the unique class that is statistically consistent with the normalized discounted gain (NDCG) criterion [19]. The algorithm uses alternating projection style updates, in which one set of simultaneous projections can be computed independent of the Bregman divergence and the other reduces to parameter estimation of a generalized linear model. This results in easily implemented, efciently parallelizable algorithm for the LETOR task that enjoys global optimum guarantees under mild conditions. We present empirical results on benchmark datasets showing that this approach can outperform the state of the art NDCG consistent techniques.

of permutations, which signicantly increase the difculty of learning and optimization compared to regression based approaches. We propose an approach to the LETOR task that retains the simplicity of the regression based models, is simple to implement, is embarrassingly parallelizable, and yet is a function of ordering alone. Furthermore, MR enjoys strong guarantees of convergence, statistical consistency under uncertainty and a global minimum under mild conditions. Our experiments on benchmark datasets show that the proposed approach outperforms state of the art models in terms of several common LETOR metrics. We adapt regression to the LETOR task by using Bregman divergences and monotone retargeting (MR). MR is a novel technique that we introduce in the paper and Bregman divergences [5] are a family of distance like functions well studied in optimization [8], statistics and machine learning [2] due to their one to one connections with modeling uncertainty using exponential family distributions. Bregman divergences are the unique class of strongly statistically consistent surrogate cost functions for the NDCG criterion [19], a de facto standard of ranking quality. In addition to these statistical properties, Bregman divergences have several properties useful for optimization and specifically useful for ranking. The LETOR task decomposes into subproblems that are equivalent to estimating (unconstrained as well as constrained) generalized linear models. The Bregman divergence machinery provides easy to implement, scalable algorithms for them, with a user chosen level of granularity of parallelism. We hope the reader will appreciate the exibility of choosing an appropriate divergence to encode desirable properties on the rankings while enjoying the strong guarantees that come with the family. We introduce MR by rst discussing direct regression of rank scores and highlighting its primary deciency: its attempt to t the scores exactly. An exact t is unnecessary since any score that induces the correct ordering is sufcient. MR addresses this problem by searching for a order preserving transformation of the target scores that may be easier for the regressor to t: hence the name retargeting. Let us briey sketch our line of attack. In section 2 we

Introduction

Structured output space models [1] have dominated the task of learning to rank (LETOR). Regression based models have been justiably superseded by pairwise models [11], which in turn are being gradually displaced by list-wise approaches [6, 17]. This trend has on one hand greatly improved the quality of the predictions obtained but on the other hand has come at the cost of additional complexity and computation. The cost functions of structured models are often dened directly on the combinatorial space

* Both authors contributed equally.

present a method to reduce the optimization over the innite class of all monotonic increasing functions to that of alternating projection over a nite dimensional vector space. In section 3.2.3 we show when that optimization problem is jointly convex by resolving the question of joint convexity of the Fenchel-Young gap. This result is important in its own right. We introduce Bregman divergences in section 3 and discuss properties that make them particularly suited to the ranking task. We show (i) that one set of the alternating projections can be computed in a Bregman divergence independent fashion 3.2.1, and (ii) separable Bregman divergences allow us to use sorting 3.2.2 that would have otherwise required exhaustive combinatorial enumeration or solving a linear assignment problem repeatedly. Notation: Vectors are denoted by bold lower case letters, matrices are capitalized. x denotes the transpose of the vector x, ||x|| denotes the L2 norm. Diag(x) denotes a diagonal matrix with its diagonal set to the vector x. Adj-Di(x) denotes a vector obtained by taking adjacent difference of consecutive components of [ x 0 ] . Thus Cum-Sum(Adj-Di(x)) = x. A vector x is dened to be in descending order if xi xj if i > j, the set of such vectors is denoted by R. Vector x is isotonic with y if xi xj then yi yj . The unit simplex is denoted by and the positive orthant by + d . () is used to denote the Legendre dual of the function (). Partitions of sets are denoted by and P.

with the parameter w. As discussed, this is unnecessarily stringent for ranking. A better alternative is:
w,i M

min

i , i f (Ai , w) , Di r
i

where i : |Vi | |Vi | transforms the component of its argument by a xed monotonic increasing function i , and M is the class of all such functions. Now f (Ai , w) i point-wise to incur zero loss. no longer need to equal r It is sufcient for some monotonic increasing transform of f (Ai , w) to do so. With no loss in generality of modeling, i instead. This we may apply the monotonic transform to r avoids the minimization over the function composition, but the need for minimizing over the set of all monotone functions remains. One possible way to eliminate the minimization over the function space is to restrict our attention to some parametric family in M at the expense of generality. Instead, with no loss in generality, the optimization over the innite space of functions M can be converted into one over nite dimensional vector spaces |Vj | , provided we have a nite characterization of the constraint set Ri :

w,r Ri

min

M M Di (ri , f (Ai , w)) s.t. Ri = {r |M }. (r i )=r i

Monotone Retargeting

We introduce our formulation of learning to rank, this consists of a set of queries Q = {q1 , qi . . . q|Q| } and a set of items V that are to be ranked in the context of the queries. For every query qi , there is a subset Vi V whose elements have been ordered, based on their relevance to the query. This ordering is customarily expressed via a rank i di =|Vi | whose components r score vector r ij correspond to items in Vi . Beyond establishing an order over the set Vi , the actual values of r ij are of no signicance. For a query qi the index j of r ij is local to the set Vi hence r ij and r kj need not correspond to the same object. We shall further assume, with no loss in generality, that the subscript j is assigned such that r ij is in a descending order for any i induces a partial order if the number of Vi . Note that r unique values ki in the vector is less than di .

(1) The Set Ri : The convex composition r = r1 +(1)r2 of two isotonic vectors r1 and r2 preserves isotonicity, as does the scaling r1 for any + . Hence the set Ri is a convex cone. This makes the problem computationally tractable because the set can be described entirely by its extreme rays, or by the extreme rays of its polar. We claim the set Ri can be expressed as the image of the set { + }|Vi |1 under a linear transformation by a particular upper triangular matrix U with positive entries:

Ri = U x

s.t. x {

R+}|V |1 R
i

The matrix U is not unique and can be generated from any vector v + |Vi | , but as we shall see, any member from the allowed class of U is sufcient for a exhaustive representation of Ri .

Lemma 1. The set of all vectors in d that are sorted in a descending order is given by U x s.t. x { + }|Vi |1 where U is a triangular matrix generated from a vector v + d such that the ith row U (i, :) is {0}i1 v (i :)

For every query-object pair {qi , vij } a feature vector n aij = F (qi , vij ) is pre-computed. The subset of training i , Ai } and is data pertinent to any query qi is the pair {r i consists of the called its qset. Thus, the column vector r rank-scores r ij and Ai is a matrix whose j th row is aij . Given a loss function Di : f :
i i

i for any vector r i sorted Proof. Consider solving U x = r 1 in descending order. We have x = (Diag) (v ) Adj-Di( ri ) which is in { + }|Vi |1

R|V | R|V |
i

dene the regression problem min


w i

i + we may i , f (Ai , w)) where D(r

For regression functions capable of tting an arbitrary additive offset, no generality is lost by constraining the last component to be non-negative. In addition to the set Ri we shall make frequent use of the set of all discrete probability distributions that are in descending order, i.e. Ri i that we represent by i o.

R|V |n Rn R|V | is some xed parametric form

We give a similar representation of this set by generating an upper triangular matrix T from the vector v = 1 1 {1, 1 2 , i d } and considering x . Lemma 2. The set o of all discrete probability distributions of dimension d that are in descending order is the image T x s.t. x where T is an upper triangular 1 matrix generated from the vector v = {1, 1 2 d } such i1 that T (i, :) = {0} v (i :) Proof. The proof follows Lemma (1). T x is in the simplex because it is a convex combination of vectors in . With appropriate choices of the distance like function Di (, ) and the curve tting function f (, ) we can transform (1) into a bi-convex optimization 1 problem over a product of convex sets. We choose Di (, ) to be a Bregman divergence D , dened in Section 3.1, and f (Ai , w) to be () min
1

(x)
1 ||x||2 W 2

D x y y ||2 W i wKL (x y ) = i wi xi log( x ) yi wGI (x y ) i 1 = i wi (xi 1) log( x )xi +yi yi 1


1 ||x 2

wi xi log xi x wi (xi log xi xi ) x +d


i i

Table 1: Examples of WIS Bregman divergences. class has properties particularly suited to ranking. Mahalonobis distance with diagonal W , weighted KL divergence wKL (x y ) and weighted and shifted generalized I-divergence wGI (x y ) are in this family (Table 1). Bregman Projection: Given a closed convex set S , the Bregman-projection of q on S is Proj (q, S ) Argminp D p q p S . A function () has modulus of strong convexity s if
s (x +(1 )y ) (x)+(1 )(y ) 2 (1 )||x y ||2 .

( Ai , w ), leading to the formulation:


|Q|

wW ,r Ri

i=1

1 1 D ri () ( Ai , w ) . |Vi |

(2)

For a twice differentiable (x) this means that eigenvalues of its Hessian are lower bounded by s. The Legendre conjugate () of the function () is de ned as () (x) (x) sup ( , x ()). If () is a convex function of Legendre type [21], as will always be the case in this paper, () () = () and 1 (()) = () is a one to one mapping. The Fenchel-Young inequality (3) is fundamental to convex analysis and plays an important role in our analysis. (y ) + (x) y , x 0. 3.2 Properties (3)

Coordinate-wise updates of (2) are equivalent to learning canonical GLMs under linear constraints for which scalable techniques are known [9]. The LETOR task has additional structure that allows more efcient solutions.

Background

We make heavy use of identities and algorithms associated with Bregman divergences, some that to the best of our knowledge are new e.g. Theorem 2, Lemmata 3, 4 and independent proof of Theorem 1. Theorem 2, and Lemmata 3, and 4 are particularly relevant to ranking. The purpose of this section is to collect these results in a single place. 3.1 Denitions

The convexity of (2) in r and w (separately) can be proven by verifying the identity (by evaluating its LHS): D (y ) (x) = D x y . 3.2.1 Universality Of Minimizers (4)

Bregman Divergence: Let : R, = dom Rd be a strictly convex, closed function, differentiable on int . The corresponding Bregman divergence D : dom() int(dom())

R+ is dened as D

x y

(x) (y ) x y , (y ) . From strict convexity it follows that D x y 0 and D x y = 0 iff x = y . Bregman divergences are (strictly) convex in their rst argument, but not necessarily convex in their second.

A mean-variance like decomposition (described in the appendix, Theorem (3) ) holds for all Bregman divergences. It plays a critical role in Theorem 1 that has signicant impact in facilitating the solution of the LETOR problem.
d Theorem 1. For R the entire set of vectors with descending ordered components, the minimizer y = ArgminD x y is independent of () if () is WIS. y R

Rn

In this paper we only consider functions of the form () : x i wi (xi ) that are weighted sums of identical scalar convex functions applied to each component. We refer to this class as weighted, identically separable (WIS) or simply IS if the weights are equal. This
A biconvex function is a function of two arguments such that with any one of its argument xed the function is convex in the other argument.
1

Proof: sketched in the appendix. Following our independent proof of Theorem 1, we have since come across an older proof [20] in the context of maximum likelihood estimators of exponential family models under conic constraints that were developed prior to the

popularity of Bregman divergences. Whereas the older proof uses Moreaus cone decomposition[21], ours uses Theorem 3 (in appendix) and yields a much shorter proof. Corrolary 1. If dom () = d where ()is the Legendre conjugate of the WIS convex function () then ArgminyRdom D y ()
1

(x) = ()

(y )

Although this claries the issue of separate convexity in w and ri , the conditions under which joint convexity is obtained is not obvious. Joint convexity, if ensured, guarantees global minimum even for a coordinate-wise minimization because our constraint set is a product of convex sets. We resolve this important question in Theorem 2. Theorem 2. The gap in the Fenchel-Young inequality (y ) + (x) x, w for any twice differentiable strictly convex () with a differentiable conjugate () () = () 2 is jointly convex if and only if (x) = c||x|| for all c > 0. Proof: sketched in the appendix. 3.3 Algorithms

where y = ArgminyR ||x y ||2 . Corollary (1) implies that for choices of convex function () indicated, the minimization over ri R dom can be obtained by transforming the equivalent squared 1 loss minimizer by () (). The squared loss minimization is not only simpler but its implementation can now be shared across all different ()s where Corollary 1 applies. This class of convex functions is the same as essentially smooth [21]. Three such functions are listed in Table 1. 3.2.2 Optimality of Sorting

For any sorted vector x, nding the permutation of y that minimizes D x y shows up as a subproblem in our formulation that needs to be solved in an inner loop. Thus solving it efciently is critical and this is yet another instance where Bregman divergences are very useful. For an arbitrary divergence function the search over the optimal permutation is a non-linear assignment problem that can be solved only by exhaustive enumeration. For an arbitrary separable divergence the optimal permutation may be found by solving a linear assignment problem, which is an integer linear program and hence also expensive to solve (especially in an inner loop, as required in our algorithm). On the other hand, if () is IS, the solution is remarkably 1 simple, as shown in Lemma 3 where x x2 denotes a vector 2 in with components x1 and x2 .

Now we discuss Bregmans algorithm associated with Bregman divergences. The original motivation for introducing [5] Bregman divergence was to generalize alternating orthogonal projection. A signicant advantage of the algorithm is its scalability and suitability for parallelization. It solves the following (Bregman projection) problem: min D x y s.t. Ax b
x

(5)

Bregmans algorithm: Initialize: 0

(z 0 ) = A |(y ) Repeat: Till convergence

R+

and z 0 such that 0 , 1


Update: Apply Sequential or Parallel Update Solve: (z t+1 ) = A |(y ) t+1 , 1


Lemma 3. If x1 x2 and y1 y2 and () is IS, then


D D
D
x1 x2 y2 y1 y1 y2 x1 x2

Sequential Bregman Update: Select i: Let Hi = {z | ai , z bi } If in violation: Compute Proj z t , Hi i.e. (Proj z t , Hi ) = (z t ) + ct i ai ,

x1 x2

y2 y1

and D

y1 y2

x1 x2

Proof.
x1 x2 y1 y2

x1 x2

y2 y1

= ((y2 ) (y1 )), x1 x2 .

Update: t+1 = t + ct i 1i

There exists c 0 s.t. x1 x2 = c(y1 y2 ). Proof follows from monotonicity of , ensured by convexity of . We can exchange the order of the arguments using (4). Using induction over d for y d the optimal permutation is obtained by sorting. Not only is Lemma 3 extremely helpful in generating descent updates, it has fundamental consequences in relation to the local and global optimum of our formulation (see Lemma 4). 3.2.3 Joint Convexity and Global Minimum

Parallel Bregman Update: For all i in parallel: Compute Proj z t , Hi , ct i


+1 Update: t = t + ct i 1i i

Synchronize: t+1 = 1 (

(i t+1 ))

LETOR with Monotone Retargeting

Using Legendre duality one recognizes that equation (2) quanties the gap in the Fenchel-Young inequality (3).
D ri ()1 (Ai w) = (Ai w) + (ri ) ri , Ai w .

Our cost function is an instantiation of (2) with a WIS Bregman divergence. In addition, we include regularization and a query specic offset. Note that the cost function (2) is not invariant to scale. For example squared Euclidean, KL

divergence and generalized I-divergence are homogeneous functions of degree 2, 1 and 1 respectively. Thus the cost can be reduced just by scaling its arguments down, without actually learning the task. To remedy this, we restrict the ri s from shrinking below a pre-prescribed size. This is accomplished by constraining ri s to lie in an appropriate closed convex set separated from the origin, for example, an unit simplex or a shifted positive orthant. This yields: |Q| 1 1 D ri () (Ai w + i 1) min i ,w,ri Ri Si |V | i i=1 + or equivalently
i ,w,ri Ri Si |Q|

For optimizing over w one may use several techniques available for parallelizing a sum of convex functions, for example parallelize the gradient computation across the terms or use more specialized technique such as alternating direction of multipliers [4]. Further, {w, {i }} can be solved jointly simply by augmenting the feature matrix Ai with 1. We hope the readers will appreciate this exibility of being able to exploit parallelism at different levels of granularity of choice. 4.1 Partial Order

C ||w||2 , (6) 2

min

i=1

1 D Ai w + i 1 (ri ) + |Vi |

C ||w||2 , (7) 2 where Si are bounded sets excluding 0, chosen to suit the divergence. The parameter C is the regularization parameter. In non-transductive settings, the query specic offsets i will not be available for the test queries. This causes no difculty because i does not affect the relative ranks over the documents. We update the ri s and {w, {i }} alternately. Note that each is a Bregman projection. If Si = dom and dom = d , the optimization over ri reduces to an order constrained least squares problem (corrolary-1). Examples of such matched pairs are (i) wKL ( ) and i , and (ii) shifted wGI ( ) and 1+ + d . A well studied, scalable algorithm for the ordered least squares problem is pool of adjacent violators (PAV) algorithm [3]. One can verify that PAV, like Bregmans algorithm (5) is a dual feasible method. One may also use Lemma 1 to solve it as a non-negative least squares problem for which several scalable algorithm exists [15].

Recall that a partial order is induced if the number of i is less than di . In this case our unique rank scores ki in r convention of indexing Vi in a descending order is ambiguous. To resolve this, we break ties arbitrarily. Consider a subset of Vi whose elements have the same training rankscore. We distinguish between two modeling choices: (a) the items in that subset are not really equivalent, but the training set used a resolution that could not make ne distinctions between the items,2 we call this the hidden order case, and (b) the items in the subset are indeed equivalent and the targets are constrained to reect the same block structure, we call this case block equivalent and can model it appropriately. Although we have removed the discussion on the latter in the interest of space, this too can be modeled efciently by MR. 4.1.1 Partially Hidden Order

To be able to use Bregmans algorithm it is essential that Ri be available as an intersection of linear constraints, as is readily obtained for any prescribed total order, as shown: Ri = {ri,j +1 ri,j 0}j Ji , o i = Ri {
j

rij = 1} {ri,di > 0}.

(8)

Partial orders are discussed in section 4.1. The advantages of the Bregman updates (3.3), are that they are easy to implement (more so when Proj (, ) is available in closed form e.g. squared Euclidean), have minimal memory requirements, and hence they scale readily and allow easy switch from a sequential to a parallel update. The parallel Bregman updates applied to (2), (8) clearly exposes massive amounts of ne grained parallelism at the level of individual inequalities in Ri or o i , and is well suited for implementation on a GPGPU[18]. We note further that the optimization for ri is independent for each query, thus can be embarrassingly parallelized.

In this model we assume that the items are totally ordered, though the ner ordering between similar items is not visi ible to the ranking algorithm. Let Pi = {Pik }k k=1 be a partition of the index set of Vi , such that all items in Pik have the same training rank-score. We denote their sizes by dik = |Pik |. The sets Vi effectively get partitioned furi ther into {Pik }k k=1 by the ki < di unique scores given to each of its members. Though such a score species an order between items from any two different sets Pij and Pil , the order within any set Pik remains unknown. This is very common in practice and is usually an artifact of the high cost of acquiring training data in a totally ordered form. The optimization problem may be solved using either an inner or an outer representation of the constraint sets. Outer representation: Recall that Bregmans algorithm 3.3 is better suited for the outer representation (8). Denote the set of rank-score vectors having the same pari by Ri . For partial order we tially ordered structure as r may describe Ri by linear inequalities as follows:
i 1 {rim > rin }k j =1 i[1,|Q|] , m Pij , n Pi,j +1 ,

with each j generating dij di,j +1 inequalities, which is very high. The proliferation of inequalities may be reduced by
2 or that, we only care to reduce the error of predicting rij > rik when r ij < r ik , note the strict inequality.

xi Pi w

t+1

= Argmin D T x ()
x

Pi Ai w + i
1 t

i
t

(10) i (11) (12)

still, the parameterization is discontinuous because of the discrete nature of P. While one may address the discreteness problem via a realrelaxation of P to doubly stochastic matrices, the local minima attained in such a case will be in the interior of the Birkhoff polytope and not at the vertices that (11) sorting would have obtained. Therefore such a convex relaxation cannot answer the question whether (10) and sorting achieves the local minimum. Thus it is surprising that sorting followed by the xi updates does achieve the local minimum of (6) on the cone Ri , as a consequence of the following Lemma. Lemma 4. Let vectors titioned. Let
y1 y2 x1 x2

t+1

= Argmin D T xi

t+1

()

Ai w + i

t+1

, {i

t+1

}=
ti +1

|Q|

Argmin
w,{i } i=1

D T x

()

Pi

t+1

Ai w + i

C 2 ||w|| 2

Figure 1: Algorithm for Partially Hidden Order


i 1 introducing auxiliary variables {r i,l }k l=1 and the following inequalities:

and

y1 y2

be conformally pary y
1 2

{r i,j +1 > riPij > r i,j }i[1,|Q|] .

(9)

= Argminyi (y1 ), D
y1 y2

x1 x2

However, since Bregmans algorithms are essentially coordinate-wise ascent methods, their convergence may slow unless ne grained parallelism can be exploited. For commodity hardware, an alternative to the exterior point methods are interior point methods that use an inner representation of the convex constraint set. Inner representation: For our experiments we use the updates in gure 1. In particular, we use the method of D proximal gradients for (10) where the proximal term is a Bregman divergence dened by a convex function whose domain is the required constraint set [13], [22], [16]. This automatically enforces the required constraints. To handle partial order we introduce a block-diagonally restricted permutation matrix Pi that can permute indices in each Pij independently. Since the items in Pij are not equivalent they are available for re-ordering as long as that minimizes the cost (6). Block weighted IS Bregman divergences have the special property that sorting minimizes the divergence over all permutations (Lemma 3). Thus update (11) can be accomplished by sorting. The updates (10), (11) and (12) each reduce the lower bounded cost (6), and therefore the algorithm described in gure 1 converges. However, the vital question about whether the updates converge to a stationary point remains. Convergence to a Stationary Point If repeated application t+1 of (10) and (11) (sorting) for a xed wt+1 , {i } achieves the minimum then convergence to the stationary point is guaranteed. Thus we explore the question whether (11) and (10) together achieves a local minimum. The tri-factored form ri Pi U xi is a cause for concern. Somewhat re-assuring is the fact that the range of Pi U xi is Ri which again is a convex cone and that the trifactored representation of any point in that cone is described uniquely. This however is not sufcient to ensure that a minimum is achieved by (10) and (11) because though the constraint set is convex, the cost function (6) is not convex in the tri-factored parameterization. Worse

where (yi ) is the set of all permutations of the vector yi . If the Bregman divergence D is conformally separa ble then yi is isotonic with xi i = 1, 2
Proof. The proof is by contradiction. Assume yi is a mini mizer that is not isotonic with xi , then one may permute yi to match the order of xi to obtain a reduced cost, yielding a contradiction.

The utility of Lemma 4 is that in spite of the caveats mentioned it can correctly identify the internal ordering of the components of the left hand side that achieves the minit+1 mum for a xed wt+1 , {i }, given a xed right hand side. With the knowledge of the order obtained, one may then compute the actual values with relative ease with (10).

Experiments

We evaluated the ranking performance of the proposed monotone retargeting approach on the benchmark LETOR 4.0 datasets (MQ2007, MQ2008) [23] as well as the OHSUMED dataset [12]. Each of these datasets is prepartitioned into ve-fold validation sets for easy comparison across algorithms. For OHSUMED, we used the QueryLevelNorm partition. Each dataset contains a set of queries, where each document is assigned a relevance score from irrelevant (r = 0) to relevant (r = 2). All algorithms were trained using a regularized linear ranking function, with a regularization parameter chosen from the set C {1050 , 1020 , 1010 , 105 , 100 , 101 }. The best model was identied as the model with highest mean average precision (MAP) on the validation set. All presented results are of average performance on the test set. As the baseline, we implemented the NDCG consistent renormalization approach in [19] (using the NDCGm normalization) for the squared loss and the I-divergence (generalized KL-divergence). ListNet was implemented [7] as the KL divergence baseline since their normalization has

no effect on KL-divergence. MR was implemented using the partially hidden order monotone retargeting approach (Section 4.1). We compared the performance of MR (Nor1 malized MR) to the MR method with the normalization |V i| removed (Unnormalized MR). The algorithms were implemented in Python and executed on a 2.4GHz quad-core Intel Xeon processor without paying particular attention to writing optimized code. Ample room for improvement remains. Square loss was the fastest with respect to average execution times per iteration at 0.58 seconds whereas KL achieved 1.01 seconds per iteration and I-div 1.14 seconds per iteration. We found that although MQ2007 is more than 4 times larger than MQ2008, MQ2007 only required about twice the time execution on average, highlighting the scalability of MR. On average SQ, KL and I-div took 99, 90 and 65 iterations. Table 5 compares the algorithms in terms of expected reciprocal return (ERR) [10], Mean average precision (MAP) and NDCG. Unnormalized KL divergence cost function led to the best performance across datasets. The most signicant gains over the baseline were for the I-divergence cost function. Monotone retargeting showed consistent performance gains over the baseline across metrics (NDCG, ERR, Precision), suggesting the effectiveness of MR for improving the overall ranking performance. Figure 2 shows a subset of performance comparisons using the NDCG@N and Precision@N metrics. Our experiments show a signicant improvement in performance on the range of datasets and cost functions. Across datasets, the difference between the baseline and our results were most signicant with the I-divergence (generalized KL divergence) cost function. There are two things worth taking special note of: (i) Although the baseline algorithms were proposed specifically for improving NDCG performance, MR improves the ranking accuracy further, even in terms of NDCG. (ii) MR seems to be consistently peaking early. This property is particularly desirable and is encoded specically in the cost functions such as NDCG and ERR. In our initial formulation we used WIS Bregman divergence so that the weights could be tuned to obtained the early peaking behavior. However that proved unnecessary because even the unweighted model produced satisfactory performance. The effect of query length normalization was, however, inconsistent. Some of our results were insensitive to it, whereas other results were adversely affected. We conjecture that it is an artifact of using the same amount of regularization as in the un-normalized case.

MQ 2007 ERR I-div SQ Unnormalized MR 0.3698 0.3703 Normalized MR 0.3702 0.3601 Baseline 0.1953 0.3639 MQ 2007 MAP I-div SQ Unnormalized MR 0.5379 0.5361 Normalized MR 0.5358 0.5282 Baseline 0.3611 0.5330 MQ 2007 NDCG I-div SQ Unnormalized MR 0.6961 0.7398 Normalized MR 0.6954 0.6953 Baseline 0.5512 0.6927 MQ 2008 ERR I-div SQ Unnormalized MR 0.4137 0.41559 Normalized MR 0.4144 0.41392 Baseline 0.2724 0.40978 MQ 2008 MAP I-div SQ Unnormalized MR 0.6439 0.6532 Normalized MR 0.6449 0.6549 Baseline 0.4513 0.6428 MQ 2008 NDCG I-div SQ Unnormalized MR 0.7339 0.7398 Normalized MR 0.7346 0.7396 Baseline 0.5892 0.7344 OHSUMED ERR I-div SQ Unnormalized MR 0.5657 0.5410 Normalized MR 0.5796 0.5093 Baseline 0.2255 0.5450 OHSUMED MAP I-div SQ Unnormalized MR 0.4537 0.4417 Normalized MR 0.4463 0.4394 Baseline 0.3421 0.4465 OHSUMED NDCG I-div SQ Unnormalized MR 0.7000 0.6878 Normalized MR 0.6935 0.6798 Baseline 0.5805 0.6892

KL 0.3737 0.3731 0.3643 KL 0.5398 0.5399 0.5380 KL 0.6978 0.6981 0.6952 KL 0.4238 0.4085 0.4132 KL 0.6571 0.6461 0.6530 KL 0.7451 0.7330 0.7399 KL 0.5410 0.5093 0.5467 KL 0.4531 0.4506 0.4524 KL 0.6997 0.6916 0.6947

Conclusion and Related Work

Table 2: Test ERR, MAP and NDCG on different datasets. The best results are in bold.

One technique that shares our motivation of learning to rank is ordinal regression [14] which optimizes parameters

0.7 0.5 0.6 0.4 0.5 0.3

0.6 0.5 0.4 0.3

0.4 0.2 0.3 0.21

0.2

0.11

I-div Baseline I-div Proposed Normalized I-div Proposed Unnormalized


2 3 4

I-div Baseline I-div Proposed Normalized I-div Proposed Unnormalized


2 3 4

0.1 10 0.01 2 3 4 5

I-div Baseline I-div Proposed Normalized I-div Proposed Unnormalized

NDCG@N

10

NDCG@N

NDCG@N

10

0.6

0.5

I-div Baseline I-div Proposed Normalized I-div Proposed Unnormalized

0.65 0.60 0.55 0.50

I-div Baseline I-div Proposed Normalized I-div Proposed Unnormalized

0.7 0.6 0.5 0.4 0.3 0.2

0.4

0.45 0.40 0.35

0.3

0.2 1 2 3 4 5

0.30

Pre@N

10

0.251

Pre@N

10

0.11

I-div Baseline I-div Proposed Normalized I-div Proposed Unnormalized


2 3 4 5

Pre@N

10

0.52 0.51 0.50 0.49 0.48 0.47 0.46 0.451

KL Baseline KL Proposed Normalized KL Proposed Unnormalized

0.70 0.65 0.60

0.56 0.54 0.52 0.50

KL Baseline KL Proposed Normalized KL Proposed Unnormalized

0.55 0.48 0.50 0.451

KL Baseline KL Proposed Normalized KL Proposed Unnormalized


2 3 4

0.46 10 0.441 2 3 4 5 6 7 8 9 10

NDCG@N

10

NDCG@N

NDCG@N

0.58 0.56 0.54 0.52

KL Baseline KL Proposed Normalized KL Proposed Unnormalized

0.65 0.60 0.55

KL Baseline KL Proposed Normalized KL Proposed Unnormalized

0.70 0.65 0.60

KL Baseline KL Proposed Normalized KL Proposed Unnormalized

0.50 0.50 0.48 0.46 0.441 2 3 4 5 0.45 0.40 6 7 8 9 10 0.351 2 3 4 5 6 7 8 9 10 0.50 0.451 0.55

Pre@N

Pre@N

Pre@N

10

0.52 0.51 0.50 0.49 0.48 0.47 0.46 0.451

Square Baseline Square Proposed Normalized Square Proposed Unnormalized

0.70 0.65 0.60 0.55 0.50 0.451

Figure 2: MR performance vs. NDCG


Square Baseline Square Proposed Normalized Square Proposed Unnormalized
2 3 4

NDCG@N

10

NDCG@N

10

consistent baseline measured using NDCG@N and Precision@N. Datasets: MQ2007 (left), MQ2008 (middle), OHSUMED (right).

of a regression function as well as thresholds. Unfortunately we do not have space for a full literature survey and only mention a few key differences. The log-likelihood of the classic ordinal regression methods are a sum of logarithms of differences of monotonic functions and are much more cumbersome to optimize over. To our knowledge they do not share the strong guarantees that MR with Bregman divergences enjoys. The prevalent technique there seems to require xing a nite number of thresholds up-front, an arbitrary choice that MR does not make. However, it is not clear if that is a restriction of ordinal regression or is a prevalent practice. In this paper we introduced a family of new cost functions for ranking. The cost function takes into account all possible monotonic transforms of the target scores, and we show how such a cost function can be optimized efciently. Because the sole objective of learning to rank is to output good permutations on unseen data, it is desirable that the cost function be a function of such permutations. Though several permutation dependent cost functions have been proposed, they are extremely difcult to optimize over and one has to resort to surrogates and/or cut other corners. We show that with monotone retargeting with Bregman divergences such contortions are unnecessary. In addition, the proposed cost function and algorithms have very favorable statistical, optimization theoretic, as well as empirically observed properties. Other advantages include extensive parallelizability due to simple simultaneous projection updates that optimize a cost function that is convex not only in each of the arguments separately but also jointly, with a proper choice of the cost function from the family.

into 1 , 2 obtaining a lower cost (by Corollary 3). Let the 2, 2 = y means of 2 , y 2 . By denition y2 is their convex combination. Now y 2 y 2 , else one can reduce the cost 2, 2 . Therefore y by rening 2 into 2 y2 y 2 . Note y1 y 2 or else we can rene P1 into 1 , 2 . Thus we have y1 y 2 y 2 which is in contradiction with y1 < y2 . Theorem 2 Proof. For succinctness we use the abbreviations: x() = x1 + (1 )x2 , y () = y1 + (1 )y2 , i = (xi ), i = (xi ), () = 1 + (1 )2 and () = 1 + (1 )2 . Joint convexity is equivalent to (x()) + (y ()) x(), y () () + () x1 , y1 (1 ) x2 , y2 x1 , x2 dom , y1 , y2 dom . Thus we have to show: B
(x())+ (y ()) ()+()+(1 ) x1 x2 , y1 y2 (13)

for all arguments in the domain. Assume with no loss in generality that () and () are strongly convex with modulus of strong convexity (1 + s1), (1 s2) with s1 > 1, s2 < 1, respectively. Reciprocal of the modulus of strong convexity of the Legendre dual is the Lipschitz constant of the gradient of a convex function [21], therefore 1 (1 + s1) 1 s2 , being the lower and upper bounds of the eigenvalues of the Hessian of () respectively. Simplifying expression (13) using our strong convexity assumptions and positivity of (1 ), we obtain that we have to show (1 + s1)||x1 x2 ||2 + (1 s2)||y1 y2 ||2 2B 0. Or ||(x1 x2 )(y1 y2 )||2 +s1||x1 x2 ||2 s2||(y1 y2 )||2 0. Let p = x1 x2 and q = y1 y2 . By choosing (1+ s)p = q we obtain s1 > s2 + s1s2, or equivalently (1 + s1) 1 1s2 . Thus the lower and upper bounds of the eigenvalues of Hessian of () must coincide.

Appendix: Proof Sketches

Theorem 1 Proof. Let the components of y take k unique values. Partition the set indexing the components into = {i }k i=1 = ci i[1,k] . Let the scalar mean of s.t. j i yj x on i be i . By (14), = j i D xj yj j i D xj i + D i ci . First, we prove by contra diction that yj = i j i , otherwise y d s.t. yl = yl l / i , and j i ci+1 yj = c ci1

Appendix: Optimality of Means

Theorem 3. [2] Let be a distribution over x dom and = [x] then the expected divergence about s is
x

D x s

D x

+ D s .

(14)

From non-negativity of Bregman divergence it follows that Corrolary 2. [2]


x

s.t. D i c

< D i ci . Thus D x y

<

[x] = Argmin

y dom x

D x y

D x y , clearly a contradiction. Let ArgminyR D x y = z for = . Let z induce the partition P = {Pl }m l=1 . If = P always, then yj = zj completing the proof. Shift the indexing of the partitions to the rst j where j , Pj differs. Now with new index, WLOG 3 assume 1 P1 , P1 1 2 . Dene 2 = 2 P1 , 2 = 2 \ 2 = , else P1 can be rened
3

Corrolary 3. If random variable x takes values in X = X1 X2 with X1 X2 = then Argmin D x


X x|X

Argmin

1 X1 x|X1

D x 1

+ Argmin

2 X2 x|X2

D x 2

Acknowledgements Authors acknowledge support from NSF grant IIS 1016614 and thank Cheng H. Lee for suggesting improvements over our initial submission.

We encourage the reader to draw a picture for clarity.

References
[1] G ukhan H. Bakir, Thomas Hofmann, Bernhard Sch olkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan. Predicting Structured Data (Neural Information Processing). The MIT Press, 2007. [2] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:17051749, 2005. [3] Michael J. Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47:425 439, 1990. [4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1122, 2011. [5] L. M. Bregman. The relaxation method of nding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200217, 1967. [6] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, ICML 07, pages 129136, New York, NY, USA, 2007. ACM. [7] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML 07: Proceedings of the 24th international conference on Machine learning, pages 129136, New York, NY, USA, 2007. ACM. [8] Y Censor and A Lent. An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34(3):321353, 1981. [9] Yair Censor. Row-action methods for huge and sparse systems and their applications. SIAM Review, 23:444466, 1981. [10] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM 09, pages 621630, New York, NY, USA, 2009. ACM. [11] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efcient boosting algorithm for

combining preferences. J. Mach. Learn. Res., 4:933 969, 2003. [12] William Hersh, Chris Buckley, T. J. Leone, and David Hickam. Ohsumed: an interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 94, pages 192201, New York, NY, USA, 1994. Springer-Verlag New York, Inc. [13] Alfredo N Iusem. Steepest descent methods with generalized distances for constrained optimization. Acta Applicande Mathematicae, 46:225246, 1997. [14] Valen E. Johnson and James H. Albert. Ordinal data modeling. Statistics for social science and public policy. 1999. [15] Dongmin Kim, Suvrit Sra, and Inderjit S. Dhillon. Fast projection-based methods for the least squares nonnegative matrix approximation problem. Stat. Anal. Data Min., 1(1):3851, 2008. [16] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1995. [17] Yanyan Lan, Tie-Yan Liu, Zhiming Ma, and Hang Li. Generalization analysis of listwise learning-to-rank algorithms. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pages 577584, New York, NY, USA, 2009. ACM. [18] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):4053, March 2008. [19] Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On NDCG consistency of listwise ranking methods. In Proceedings of 14th International Conference on Articial Intelligence and Statistics, AISTATS, 2011. [20] R.E.Barlow and H.D.Brunk. The isotonic regression problem and its dual. Journal of American Statistical Association, 67(337):140147, 1972. [21] R T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, December 1996. [22] Censor Y and Zenios S. The proximal minimization algorithmwith d-functions. Journal of Optimazation Theory and Applications, 73:451464, 1992. [23] Tie yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In In Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, 2007.

You might also like