(李航) A short introduction to learning to rank
(李航) A short introduction to learning to rank
10 OCTOBER 2011
1854
INVITED PAPER Special Section on Information-Based Induction Sciences and Machine Learning
that a perfect ranking π∗i ’s NDCG score at position k is 1. In 1.4 Relation with Ordinal Classification
a perfect ranking, the documents with higher grades are al-
ways ranked higher. Note that there can be multiple perfect Ordinal classification (also known as ordinal regression) is
rankings for a query and associated documents. similar to ranking, but is also different. The input of ordinal
The gain function is normally defined as an exponential classification is a feature vector x and the output is a label y
function of grade. That is to say, the satisfaction of access- representing a grade, where the grades are classes in a total
ing information exponentially increases when the grade of order. The goal of learning is to construct a model which can
relevance of information increases. assign a grade label y to a given feature vector x. The model
mainly consists of a scoring function f (x). The model first
G( j) = 2yi, j − 1, (1) assigns a real number to x using f (x) and then determines
where yi, j is the label (grade) of document di, j in ranking list the grade y of x using a number of thresholds. Specifically, it
πi . The discount function is normally defined as a logarith- partitions the real number axis into intervals and aligns each
mic function of position. That is to say, the satisfaction of interval to a grade. It takes the grade of the interval that f (x)
accessing information logarithmically decreases when the falls into as the grade of x.
position of access increases. In ranking, one cares more about accurate ordering of
objects, while in ordinal classification, one cares more about
1 accurate ordered-categorization of objects. A typical exam-
D(πi ( j)) = , (2)
log2 (1 + πi ( j)) ple of ordinal classification is product rating. For example,
given the features of a movie, we are to assign a number of
where πi ( j) is the position of document di, j in ranking list πi .
stars (ratings) to the movie. In that case, correct assignment
Hence, DCG and NDCG at position k become
of the number of stars is critical. In contrast, in ranking such
! 2yi, j − 1 as document retrieval, given a query, the objective is to cor-
DCG(k) = , (3)
log2 (1 + πi ( j)) rectly sort related documents, although sometimes training
j:π ( j)≤k
i
data and testing data are labeled at multiple grades as in or-
! 2yi, j − 1 dinal classification. The number of documents to be ranked
NDCG(k) = G−1
max,i (k) . (4) can vary from query to query. There are queries for which
j:π ( j)≤k
log2 (1 + πi ( j))
i more relevant documents are available in the collection, and
In evaluation, DCG and NDCG values are further aver- there are also queries for which only weakly relevant docu-
aged over queries. ments are available.
Table 1 gives examples of calculating NDCG values of
two ranking lists. NDCG (DCG) has the effect of giving 2. Formulation
high scores to the ranking lists in which relevant documents
are ranked high. For perfect rankings, the NDCG value at We formalize learning to rank as a supervised learning task.
each position is always one, while for imperfect rankings, Suppose that X is the input space (feature space) consisting
the NDCG values are usually less than one. of lists of feature vectors, and Y is the output space consist-
MAP is another measure widely used in IR. In MAP, ing of lists of grades. Further suppose that x is an element of
it is assumed that the grades of relevance are at two levels: X representing a list of feature vectors and y is an element of
1 and 0. Given query qi , associated documents Di , ranking Y representing a list of grades. Let P(X, Y) be an unknown
list πi on Di , and labels yi of Di , Average Precision for qi is joint probability distribution where random variable X takes
defined as: x as its value and random variable Y takes y as its value.
"ni Assume that F(·) is a function mapping from a list of
j=1 P( j) · yi, j feature vectors x to a list of scores. The goal of the learning
AP = "ni ,
j=1 yi, j
task is to automatically learn a function F̂(x) given training
LI: A SHORT INTRODUCTION TO LEARNING TO RANK
1857
data (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ). Each training instance L(F(x), y) = 1.0 − NDCG.
is comprised of feature vectors xi and the corresponding
grades yi (i = 1, · · · , m). Here m denotes the number of Note that the true loss functions (NDCG loss and MAP loss)
training instances. makes use of sorting based on F(x).
F(x) and y can be further written as F(x) = For the surrogate loss function, there are also different
( f (x1 ), f (x2 ), · · · , f (xn )) and y = (y1 , y2 , · · · , yn ). The fea- ways to define it, which lead to different approaches to learn-
ture vectors represent objects to be ranked. Here f (x) de- ing to rank. For example, one can define pointwise loss,
notes the local ranking function and n denotes the number pairwise loss, and listwise loss functions.
of feature vectors and grades. The squared loss used in Subset Regression is a point-
A loss function L(·, ·) is utilized to evaluate the predic- wise surrogate loss [6]. We call it pointwise loss, because it
tion result of F(·). First, feature vectors x are ranked ac- is defined on single objects.
cording to F(x), then the top n results of the ranking are n
!
evaluated using their corresponding grades y. If the feature L′ (F(x), y) = ( f (xi ) − yi )2 .
vectors with higher grades are ranked higher, then the loss i=1
will be small. Otherwise, the loss will be large. The loss It is actually an upper bound of 1.0 − NDCG.
function is specifically represented as L(F(x), y). Note that Pairwise losses can be the hinge loss, exponential loss,
the loss function for ranking is slightly different from the and logistic loss on pairs of objects, which are used in Rank-
loss functions in other statistical learning tasks, in the sense ing SVM [7], RankBoost [8], and RankNet [9], respectively.
that it makes use of sorting. They are also upper bounds of 1.0 − NDCG [10].
We further define the risk function R(·) as the expected
loss function with respect to the joint distribution P(X, Y), n−1 !
! n
# L′ (F(x), y) = φ(sign(yi − y j ), f (xi ) − f (x j )),
R(F) = L(F(x), y)dP(x, y). i=1 j=i+1
X×Y
where it is assumed that L′ = 0 when yi = y j and φ is the
Given training data, we calculate the empirical risk hinge loss, exponential loss, or logistic loss function.
function as follows, Listwise loss functions are defined on lists of objects,
1!
m just like the true loss functions, and thus are more directly
R̂(F) = L(F(xi ), yi ). related to the true loss functions. Different listwise loss
m i=1
functions are exploited in the listwise methods. For exam-
The learning task then becomes the minimization of the ple, the loss function in AdaRank is a listwise loss.
empirical risk function, as in other learning tasks. The mini- L′ (F(x), y) = exp(−NDCG),
mization of the empirical risk function could be difficult due
to the nature of the loss function (it is not continuous and it where NDCG is calculated on the basis of F(x) and y. Ob-
uses sorting). We can consider using a surrogate loss func- viously, it is also an upper bound of 1.0 − NDCG.
tion L′ (F(x), y).
The corresponding empirical risk function is defined as 3. Pointwise Approach
follows.
m In the pointwise approach, the ranking problem (ranking
1! ′
R̂′ (F) = L (F(xi ), yi ). creation) is transformed to classification, regression, or or-
m i=1 dinal classification, and existing methods for classification,
We can also introduce a regularizer to conduct minimization regression, or ordinal classification are applied. Therefore,
of the regularized empirical risk. In such cases, the learning the group structure of ranking is ignored in this approach.
problem becomes minimization of the (regularized) empiri- The pointwise approach includes Subset Ranking [6],
cal risk function based on the surrogate loss. McRank [11], Prank [12], and OC SVM [13]. We take the
Note that we adopt a machine learning formulation last one as an example and describe it in detail.
here. In IR, the feature vectors x are derived from a query
and its associated documents. The grades y represent the rel- 3.1 SVM for Ordinal Classification
evance degrees of the documents with respect to the query.
We make use of a global ranking function F(·). In practice, The method proposed by Shashua & Levin [13] utilizes a
it can be a local ranking function f (·). The possible number number of parallel hyperplanes as a ranking model. Their
of feature vectors in x can be very large, even infinite. The method, referred to as OC SVM in this article, learns the
evaluation (loss function) is, however, only concerned with parallel hyperplanes by the large margin principle. In one
n results. implementation, the method tries to maximize a fixed mar-
In IR, the true loss functions can be those defined based gin for all the adjacent classes (grades)† .
on NDCG (Normalized Discounted Cumulative Gain) and Suppose that X ⊆ ℜd and Y = {1, 2, · · · , l} where there
MAP (Mean Average Precision). For example, we can have †
The other method maximizes the sum of all margins.
IEICE TRANS. INF. & SYST., VOL.E94–D, NO.10 OCTOBER 2011
1858
Grade: 3, 2, 1
Documents are represented by their grades
Example 1:
ranking-1: 2 3 2 1 1 1 1
ranking-2: 3 2 1 2 1 1 1
Example 2:
ranking for query-1: 3 2 2 1 1 1 1
ranking for query-2: 3 3 2 2 2 1 1 1 1 1
Fig. 6 Example ranking lists.
negative instances in learning, because they are redundant. Fig. 7 Modified hinge loss functions.
Training data is given as {((xi(1) , xi(2) ), yi )}, i = 1, · · · , m
where each instance consists of two feature vectors
(xi(1) , xi(2) ) and a label yi ∈ {+1, −1} denoting which feature
training of Ranking SVM, which is not desirable. Ranking-
vector should be ranked ahead.
2 should be better than ranking-1, from the viewpoint of IR,
The learning of Ranking SVM is formalized as the fol-
because the result on its top is better. Note that to have high
lowing QP problem.
accuracy on top-ranked documents is crucial for an IR sys-
" tem, which is reflected in the IR evaluation measures.
minw,ξ 12 ||w||2 + C m i=1 ξi
s. t. yi ⟨w, xi(1) − xi(2) ⟩ ≥ 1 − ξi Another issue with Ranking SVM is that it equally
ξi ≥ 0 treats document pairs from different queries. In example
i = 1, . . . , m, 2, there are two queries and the numbers of documents as-
sociated with them are different. For query-1 there are 2
where xi(1) and xi(2) denote the first and second feature vectors document pairs between grades 3-2, 4 document pairs be-
in a pair of feature vectors, || · || denotes L2 norm, m denotes tween grades 3-1, 8 document pairs between grades 2-1, and
the number of training instances, and C > 0 is a coefficient. in total 14 document pairs. For query-2, there are 31 doc-
It is equivalent to the following non-constrained opti- ument pairs. Ranking SVM takes 14 instances (document
mization problem, i.e., the minimization of the regularized pairs) from query-1 and 31 instances (document pairs) from
hinge loss function. query-2 for training. Thus, the impact on the ranking model
from query-2 will be larger than the impact from query-1. In
m
! other words, the model learned will be biased toward query-
min [1 − yi ⟨w, xi(1) − xi(2) ⟩]+ + λ||w||2 , (5)
w 2. This is in contrast to the fact that in IR evaluation queries
i=1
are evenly important. Note that the numbers of documents
where [x]+ denotes function max(x, 0) and λ = 1
2C .
usually vary from query to query.
IR SVM addresses the above two problems by chang-
4.2 IR SVM ing the 0-1 pairwise classification into a cost sensitive pair-
wise classification. It does so by modifying the hinge loss
IR SVM proposed by Cao et al. [15] is an extension of Rank- function of Ranking SVM. Specifically, it sets different
ing SVM for Information Retrieval (IR), whose idea can be losses for document pairs across different grades and from
applied to other applications as well. different queries. To emphasize the importance of correct
Ranking SVM transforms ranking into pairwise clas- ranking on the top, the loss function heavily penalizes er-
sification, and thus it actually makes use of the 0-1 loss in rors related to the top. To increase the influence of queries
the learning process. There exists a gap between the loss with less documents, the loss function heavily penalizes er-
function and the IR evaluation measures. IR SVM attempts rors from the queries.
to bridge the gap by modifying 0-1 loss, that is, conducting Figure 7 plots the shapes of different hinge loss func-
cost sensitive learning of Ranking SVM. tions with different penalty parameters. The x-axis repre-
We first look at the problems caused by straightforward sents y f (x) and the y-axis represents loss. When y f (xi(1) −
application of Ranking SVM to document retrieval, using xi(2) ) ≥ 1, the losses are zero. When y f (xi(1) − xi(2) ) < 1,
examples in Fig. 6. the losses are represented by linearly decreasing functions
One problem with the direction application of Ranking with different slopes. If the slope equals −1, then the func-
SVM is that Ranking SVM equally treats document pairs tion is the normal hinge loss function. IR SVM modifies
across different grades. Example 1 indicates the problem. the hinge loss function, specifically modifies the slopes for
There are two rankings for the same query. The documents different grade pairs and different queries. It assigns higher
at positions 1 and 2 are swapped in ranking-1 from the per- weights to document pairs across important grade pairs and
fect ranking, while the documents at positions 3 and 4 are assigns normalization weights to document pairs according
swapped in ranking-2 from the perfect ranking. There is to queries.
only one error for each ranking in terms of the 0-1 loss, or The learning of IR SVM is equivalent to the following
difference in order of pairs. They have the same effect on the optimization problem. Specifically, the minimization of the
IEICE TRANS. INF. & SYST., VOL.E94–D, NO.10 OCTOBER 2011
1860
5. Listwise Approach where Πi denotes the set of all possible permutations for xi .
It can be easily shown that the ranking π̃i selected by
The listwise approach addresses the ranking problem in a Eq. (7) is equivalent to the ranking created by the ranking
more straightforward way. Specifically, it takes ranking lists model f (xi j ) (when both of them are linear functions). Fig-
as instances in both learning and prediction. The group ure 8 gives an example. It is easy to verify that both f (x)
structure of ranking is maintained and ranking evaluation and S (xi , π) will output ABC as the most preferable ranking
measures can be more directly incorporated into the loss (permutation).
functions in learning. In learning, we would ideally create a ranking model
The listwise approach includes ListNet [18], ListMLE that can maximize the accuracy in terms of a listwise eval-
[19], AdaRank [20], SVM MAP [21], and Soft Rank [22]. uation measure on training data, or equivalently, minimizes
SVM MAP and related methods are explained in this article. the loss function defined below,
m
!
5.1 SVM MAP
L( f ) = (E(π∗i , yi ) − E(πi , yi )), (8)
i=1
The algorithm SVM MAP developed by Yue et al. [21] is
designed to directly optimize MAP [2], but it can be easily where πi is the permutation on feature vector xi by ranking
extended to optimize NDCG. Xu et al. [23] further general- model f and yi is the corresponding list of grades. E(πi , yi )
ize it to a group of algorithms. denotes the evaluation result of πi in terms of an evaluation
In ranking, for query qi the ranking model f (xi j ) as- measure (e.g., NDCG). Usually E(π∗i , yi ) = 1.
signs a score to each associated document di j or feature vec- We view the problem of learning a ranking model as
tor xi j where xi j is the feature vector derived from qi and di j . the following optimization problem in which the following
The documents di (feature vectors xi ) are then sorted based loss function is minimized.
on their scores and a permutation denoted as πi is obtained. "m $ %
For simplicity, suppose that the ranking model f (xi j ) is a i=1 maxπ∗i ∈Π∗i ;πi ∈Πi \Π∗i E(π∗i , yi ) − E(πi , yi )
& ' (9)
linear model: · [S (xi , π∗i ) ≤ S (xi , πi ) ],
f (xi j ) = ⟨w, xi j ⟩, (6) where [[c]] is one if condition c is satisfied, otherwise it is
zero. π∗i ∈ Π∗i ⊆ Πi denotes any of the perfect permutations
where w denotes a weight vector.
for qi .
Suppose that labels for the feature vectors xi are also
The loss function measures the loss when the most pre-
given as yi . We consider using a scoring function S (xi , πi )
ferred ranking list by the ranking model is not the perfect
to measure the goodness of ranking πi . S (xi , πi ) is defined
ranking list. One can prove that the true loss function such
as
as that in (8) is upper-bounded by the new loss function in
S (xi , πi ) = ⟨w, σ(xi , πi )⟩, (9).
LI: A SHORT INTRODUCTION TO LEARNING TO RANK
1861