0% found this document useful (0 votes)
44 views9 pages

(李航) A short introduction to learning to rank

Uploaded by

yanshengran666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views9 pages

(李航) A short introduction to learning to rank

Uploaded by

yanshengran666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEICE TRANS. INF. & SYST., VOL.E94–D, NO.

10 OCTOBER 2011
1854

INVITED PAPER Special Section on Information-Based Induction Sciences and Machine Learning

A Short Introduction to Learning to Rank

Hang LI†a) , Nonmember

SUMMARY Learning to rank refers to machine learning techniques for


training the model in a ranking task. Learning to rank is useful for many
applications in Information Retrieval, Natural Language Processing, and
Data Mining. Intensive studies have been conducted on the problem and
significant progress has been made [1], [2]. This short paper gives an in-
troduction to learning to rank, and it specifically explains the fundamental
problems, existing approaches, and future work of learning to rank. Several
learning to rank methods using SVM techniques are described in details.
key words: learning to rank, information retrieval, natural language pro-
cessing, SVM

1. Ranking Problem Fig. 1 Document retrieval.

Learning to rank can be employed in a wide variety of


applications in Information Retrieval (IR), Natural Lan-
guage Processing (NLP), and Data Mining (DM). Typi-
cal applications are document retrieval, expert search, def-
inition search, collaborative filtering, question answering,
keyphrase extraction, document summarization, and ma-
chine translation [2]. Without loss of generality, we take
document retrieval as example in this article.
Document retrieval is a task as follows (Fig. 1). The
system maintains a collection of documents. Given a query,
the system retrieves documents containing the query words
from the collection, ranks the documents, and returns the
Fig. 2 Learning to rank for document retrieval.
top ranked documents. The ranking task is performed by
using a ranking model f (q, d) to sort the documents, where
q denotes a query and d denotes a document.
search, there are many signals which can represent rele-
Traditionally, the ranking model f (q, d) is created with-
vance, for example, the anchor texts and PageRank score of
out training. In the BM25 model, for example, it is assumed
a web page. Incorporating such information into the ranking
that f (q, d) is represented by a conditional probability distri-
model and automatically constructing the ranking model us-
bution P(r|q, d) where r takes on 1 or 0 as value and denotes
ing machine learning techniques becomes a natural choice.
being relevant or irreverent, and q and d denote a query and a
In web search engines, a large amount of search log data,
document respectively. In Language Model for IR (LMIR),
such as click through data, is accumulated. This makes it
f (q, d) is represented as a conditional probability distribu-
possible to derive training data from search log data and au-
tion P(q|d). The probability models can be calculated with
tomatically create the ranking model. In fact, learning to
the words appearing in the query and document, and thus no
rank has become one of the key technologies for modern
training is needed (only tuning of a small number of param-
web search.
eters is necessary) [3].
We describe a number of issues in learning for ranking,
A new trend has recently arisen in document retrieval,
including training and testing, data labeling, feature con-
particularly in web search, that is, to employ machine learn-
struction, evaluation, and relations with ordinal classifica-
ing techniques to automatically construct the ranking model
tion.
f (q, d). This is motivated by a number of facts. At web
Manuscript received December 31, 2010. 1.1 Training and Testing
Manuscript revised April 15, 2011.

The author is with Microsoft Research Asia, No.5 Dan Ling
St., Haidian, Beijing, 100080, China. Learning to rank is a supervised learning task and thus has
a) E-mail: [email protected] training and testing phases (see Fig. 2).
DOI: 10.1587/transinf.E94.D.1854 The training data consists of queries and documents.

c 2011 The Institute of Electronics, Information and Communication Engineers


Copyright ⃝
LI: A SHORT INTRODUCTION TO LEARNING TO RANK
1855

Each query is associated with a number of documents. The and a document.


relevance of the documents with respect to the query is also
given. The relevance information can be represented in sev- 1.2 Data Labeling
eral ways. Here, we take the most widely used approach
and assume that the relevance of a document with respect Currently there are two ways to create training data. The
to a query is represented by a label, while the labels denote first one is by human judgments and the second one is by
several grades (levels). The higher grade a document has, derivation from search log data. We explain the first ap-
the more relevant the document is. proach here. Explanations on the second approach can be
Suppose that Q is the query set and D is the docu- found in [2]. In the first approach, a set of queries is ran-
ment set. Suppose that Y = {1, 2, · · · , l} is the label set, domly selected from the query log of a search system. Sup-
where labels represent grades. There exists a total order pose that there are multiple search systems. Then the queries
between the grades l ≻ l − 1 ≻ · · · ≻ 1, where ≻ de- are submitted to the search systems and all the top ranked
notes the order relation. Further suppose that {q1 , q2 , · · · , qm } documents are collected. As a result, each query is asso-
is the set of queries for training and qi is the i-th query. ciated with multiple documents. Human judges are then
Di = {di,1 , di,2 , · · · , di,ni } is the set of documents associated asked to make relevance judgments on all the query docu-
with query qi and yi = {yi,1 , yi,2 , · · · , yi,ni } is the set of labels ment pairs. Relevance judgments are usually conducted at
associated with query qi , where ni denotes the sizes of Di five levels, for example, perfect, excellent, good, fair, and
and yi ; di, j denotes the j-th document in Di ; and yi, j ∈ Y bad. Human judges make relevance judgments from the
denotes the j-th grade label in yi , representing the relevance viewpoint of average users. For example, if the query is
degree of di, j with respect to qi . The original training set is ‘Microsoft’, and the web page is microsoft.com, then the
denoted as S = {(qi , Di ), yi }m i=1 . label is ‘perfect’. Furthermore, the Wikipedia page about
A feature vector xi, j = φ(qi , di, j ) is created from Microsoft is ‘excellent’, and so on. Labels representing rel-
each query-document pair (qi , di, j ), i = 1, 2, · · · , m; j = evance are then assigned to the query document pairs. Rele-
1, 2, · · · , ni , where φ denotes the feature functions. That is vance judgment on a query document pair can be performed
to say, features are defined as functions of a query docu- by multiple judges and then majority voting can be con-
ment pair. For example, BM25 and PageRank are typical ducted. Benchmark data sets on learning to rank have also
features [2]. Letting xi = {xi,1 , xi,2 , · · · , xi,ni }, we represent been released [4].
the training data set as S ′ = {(xi , yi )}m i=1 . Here x ∈ X and
X ⊆ ℜd . 1.3 Evaluation
We aim to train a (local) ranking model f (q, d) = f (x)
that can assign a score to a given query document pair q and The evaluation on the performance of a ranking model is
d, or equivalently to a given feature vector x. More gener- carried out by comparison between the ranking lists output
ally, we can also consider training a global ranking model by the model and the ranking lists given as the ground truth.
F(q, D) = F(x). The local ranking model outputs a sin- Several evaluation measures are widely used in IR and other
gle score, while the global ranking model outputs a list of fields. These include NDCG (Normalized Discounted Cu-
scores. mulative Gain), DCG (Discounted Cumulative Gain), MAP
Let the documents in Di be identified by the integers (Mean Average Precision), and Kendall’s Tau.
{1, 2, · · · , ni }. We define a permutation (ranking list) πi on Given query qi and associated documents Di , suppose
Di as a bijection from {1, 2, · · · , ni } to itself. We use Πi to that πi is the ranking list (permutation) on Di and yi is the
denote the set of all possible permutations on Di , use πi ( j) set of labels (grades) of Di . DCG [5] measures the good-
to denote the rank (or position) of the j-th document (i.e., ness of the ranking list with the labels. Specifically, DCG at
di, j ) in permutation πi . Ranking is nothing but to select a position k is defined as:
permutation πi ∈ Πi for the given query qi and the associated !
documents Di using the scores given by the ranking model DCG(k) = G( j)D(πi ( j)),
f (qi , di ). j:πi ( j)≤k
The test data consists of a new query qm+1 and asso-
ciated documents Dm+1 . T = {(qm+1 , Dm+1 )}. We create where Gi (·) is a gain function and Di (·) is a position dis-
feature vector xm+1 , use the trained ranking model to assign count function, and πi ( j) is the position of di, j in πi . The
scores to the documents Dm+1 , sort them based on the scores, summation is taken over the top k positions in the ranking
and give the ranking list of documents as output πm+1 . list πi . DCG represents the cumulative gain of accessing the
The training and testing data is similar to, but different information from position one to position k with discounts
from the data in conventional supervised learning such as on the positions. NDCG is normalized DCG and NDCG at
classification and regression. Query and its associated doc- position k is defined as:
uments form a group. The groups are i.i.d. data, while the !
NDCG(k) = G−1 max,i (k) G( j)D(πi ( j)),
instances within a group are not i.i.d. data. A local ranking
j:πi ( j)≤k
model is a function of a query and a document, or equiva-
lently, a function of a feature vector derived from a query where Gmax,i (k) is the normalizing factor and is chosen such
IEICE TRANS. INF. & SYST., VOL.E94–D, NO.10 OCTOBER 2011
1856

Table 1 Examples of NDCG calculation.


where yi, j is the label (grade) of di, j and takes on 1 or 0 as
Perfect ranking Formula Explanation a value, representing being relevant or irrelevant. P( j) for
(3, 3, 2, 2, 1, 1, 1) grades: 3, 2, 1
(7, 7, 3, 3, 1, 1, 1) Eq. (1) gains
query qi is defined as:
(1, 0.63, 0.5, · · ·) Eq. (2) discounts "
k:πi (k)≤πi ( j) yi,k
(7, 11.41, 12.91, · · ·) Eq. (3) DCG P( j) = ,
(1/7, 1/11.41, 1/12.91,· · ·) normalizers πi ( j)
(1,1,1,· · ·) Eq. (4) NDCG
where πi ( j) is the position of di, j in πi . P( j) represents the
Imperfect ranking Formula Explanation
(2, 3, 2, 3, 1, 1, 1) grades: 3, 2, 1
precision until the position of di, j for qi . Note that labels
(3, 7, 3, 7, 1, 1, 1) Eq. (1) gains are either 1 or 0, and thus ‘precision’ can be defined. Av-
(1, 0.63, 0.5, · · ·) Eq. (2) discounts erage Precision represents averaged precision over all the
(3, 7.41, 8.91, · · ·) Eq. (3) DCG positions of documents with label 1 for query qi .
(1/7, 1/11.41, 1/12.91,· · ·) normalizers
Average Precision values are further averaged over
(0.43, 0.65, 0.69, · · ·) Eq. (4) NDCG
queries to become Mean Average Precision (MAP).

that a perfect ranking π∗i ’s NDCG score at position k is 1. In 1.4 Relation with Ordinal Classification
a perfect ranking, the documents with higher grades are al-
ways ranked higher. Note that there can be multiple perfect Ordinal classification (also known as ordinal regression) is
rankings for a query and associated documents. similar to ranking, but is also different. The input of ordinal
The gain function is normally defined as an exponential classification is a feature vector x and the output is a label y
function of grade. That is to say, the satisfaction of access- representing a grade, where the grades are classes in a total
ing information exponentially increases when the grade of order. The goal of learning is to construct a model which can
relevance of information increases. assign a grade label y to a given feature vector x. The model
mainly consists of a scoring function f (x). The model first
G( j) = 2yi, j − 1, (1) assigns a real number to x using f (x) and then determines
where yi, j is the label (grade) of document di, j in ranking list the grade y of x using a number of thresholds. Specifically, it
πi . The discount function is normally defined as a logarith- partitions the real number axis into intervals and aligns each
mic function of position. That is to say, the satisfaction of interval to a grade. It takes the grade of the interval that f (x)
accessing information logarithmically decreases when the falls into as the grade of x.
position of access increases. In ranking, one cares more about accurate ordering of
objects, while in ordinal classification, one cares more about
1 accurate ordered-categorization of objects. A typical exam-
D(πi ( j)) = , (2)
log2 (1 + πi ( j)) ple of ordinal classification is product rating. For example,
given the features of a movie, we are to assign a number of
where πi ( j) is the position of document di, j in ranking list πi .
stars (ratings) to the movie. In that case, correct assignment
Hence, DCG and NDCG at position k become
of the number of stars is critical. In contrast, in ranking such
! 2yi, j − 1 as document retrieval, given a query, the objective is to cor-
DCG(k) = , (3)
log2 (1 + πi ( j)) rectly sort related documents, although sometimes training
j:π ( j)≤k
i
data and testing data are labeled at multiple grades as in or-
! 2yi, j − 1 dinal classification. The number of documents to be ranked
NDCG(k) = G−1
max,i (k) . (4) can vary from query to query. There are queries for which
j:π ( j)≤k
log2 (1 + πi ( j))
i more relevant documents are available in the collection, and
In evaluation, DCG and NDCG values are further aver- there are also queries for which only weakly relevant docu-
aged over queries. ments are available.
Table 1 gives examples of calculating NDCG values of
two ranking lists. NDCG (DCG) has the effect of giving 2. Formulation
high scores to the ranking lists in which relevant documents
are ranked high. For perfect rankings, the NDCG value at We formalize learning to rank as a supervised learning task.
each position is always one, while for imperfect rankings, Suppose that X is the input space (feature space) consisting
the NDCG values are usually less than one. of lists of feature vectors, and Y is the output space consist-
MAP is another measure widely used in IR. In MAP, ing of lists of grades. Further suppose that x is an element of
it is assumed that the grades of relevance are at two levels: X representing a list of feature vectors and y is an element of
1 and 0. Given query qi , associated documents Di , ranking Y representing a list of grades. Let P(X, Y) be an unknown
list πi on Di , and labels yi of Di , Average Precision for qi is joint probability distribution where random variable X takes
defined as: x as its value and random variable Y takes y as its value.
"ni Assume that F(·) is a function mapping from a list of
j=1 P( j) · yi, j feature vectors x to a list of scores. The goal of the learning
AP = "ni ,
j=1 yi, j
task is to automatically learn a function F̂(x) given training
LI: A SHORT INTRODUCTION TO LEARNING TO RANK
1857

data (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ). Each training instance L(F(x), y) = 1.0 − NDCG.
is comprised of feature vectors xi and the corresponding
grades yi (i = 1, · · · , m). Here m denotes the number of Note that the true loss functions (NDCG loss and MAP loss)
training instances. makes use of sorting based on F(x).
F(x) and y can be further written as F(x) = For the surrogate loss function, there are also different
( f (x1 ), f (x2 ), · · · , f (xn )) and y = (y1 , y2 , · · · , yn ). The fea- ways to define it, which lead to different approaches to learn-
ture vectors represent objects to be ranked. Here f (x) de- ing to rank. For example, one can define pointwise loss,
notes the local ranking function and n denotes the number pairwise loss, and listwise loss functions.
of feature vectors and grades. The squared loss used in Subset Regression is a point-
A loss function L(·, ·) is utilized to evaluate the predic- wise surrogate loss [6]. We call it pointwise loss, because it
tion result of F(·). First, feature vectors x are ranked ac- is defined on single objects.
cording to F(x), then the top n results of the ranking are n
!
evaluated using their corresponding grades y. If the feature L′ (F(x), y) = ( f (xi ) − yi )2 .
vectors with higher grades are ranked higher, then the loss i=1
will be small. Otherwise, the loss will be large. The loss It is actually an upper bound of 1.0 − NDCG.
function is specifically represented as L(F(x), y). Note that Pairwise losses can be the hinge loss, exponential loss,
the loss function for ranking is slightly different from the and logistic loss on pairs of objects, which are used in Rank-
loss functions in other statistical learning tasks, in the sense ing SVM [7], RankBoost [8], and RankNet [9], respectively.
that it makes use of sorting. They are also upper bounds of 1.0 − NDCG [10].
We further define the risk function R(·) as the expected
loss function with respect to the joint distribution P(X, Y), n−1 !
! n
# L′ (F(x), y) = φ(sign(yi − y j ), f (xi ) − f (x j )),
R(F) = L(F(x), y)dP(x, y). i=1 j=i+1
X×Y
where it is assumed that L′ = 0 when yi = y j and φ is the
Given training data, we calculate the empirical risk hinge loss, exponential loss, or logistic loss function.
function as follows, Listwise loss functions are defined on lists of objects,
1!
m just like the true loss functions, and thus are more directly
R̂(F) = L(F(xi ), yi ). related to the true loss functions. Different listwise loss
m i=1
functions are exploited in the listwise methods. For exam-
The learning task then becomes the minimization of the ple, the loss function in AdaRank is a listwise loss.
empirical risk function, as in other learning tasks. The mini- L′ (F(x), y) = exp(−NDCG),
mization of the empirical risk function could be difficult due
to the nature of the loss function (it is not continuous and it where NDCG is calculated on the basis of F(x) and y. Ob-
uses sorting). We can consider using a surrogate loss func- viously, it is also an upper bound of 1.0 − NDCG.
tion L′ (F(x), y).
The corresponding empirical risk function is defined as 3. Pointwise Approach
follows.
m In the pointwise approach, the ranking problem (ranking
1! ′
R̂′ (F) = L (F(xi ), yi ). creation) is transformed to classification, regression, or or-
m i=1 dinal classification, and existing methods for classification,
We can also introduce a regularizer to conduct minimization regression, or ordinal classification are applied. Therefore,
of the regularized empirical risk. In such cases, the learning the group structure of ranking is ignored in this approach.
problem becomes minimization of the (regularized) empiri- The pointwise approach includes Subset Ranking [6],
cal risk function based on the surrogate loss. McRank [11], Prank [12], and OC SVM [13]. We take the
Note that we adopt a machine learning formulation last one as an example and describe it in detail.
here. In IR, the feature vectors x are derived from a query
and its associated documents. The grades y represent the rel- 3.1 SVM for Ordinal Classification
evance degrees of the documents with respect to the query.
We make use of a global ranking function F(·). In practice, The method proposed by Shashua & Levin [13] utilizes a
it can be a local ranking function f (·). The possible number number of parallel hyperplanes as a ranking model. Their
of feature vectors in x can be very large, even infinite. The method, referred to as OC SVM in this article, learns the
evaluation (loss function) is, however, only concerned with parallel hyperplanes by the large margin principle. In one
n results. implementation, the method tries to maximize a fixed mar-
In IR, the true loss functions can be those defined based gin for all the adjacent classes (grades)† .
on NDCG (Normalized Discounted Cumulative Gain) and Suppose that X ⊆ ℜd and Y = {1, 2, · · · , l} where there
MAP (Mean Average Precision). For example, we can have †
The other method maximizes the sum of all margins.
IEICE TRANS. INF. & SYST., VOL.E94–D, NO.10 OCTOBER 2011
1858

Fig. 4 Example of ranking problem.


Fig. 3 SVM for oridinal classification.

exists a total order on Y. x ∈ X is an object (feature vector)


and y ∈ Y is a label representing a grade. Given object x, we
aim to predict its label (grade) y. That is to say, this is an or-
dinal classification problem. We employ a number of linear
models (parallel hyperplanes) ⟨w, x⟩−br , (r = 1, · · · , l−1) to
make the prediction, where w ∈ ℜd is a weight vector and
br ∈ ℜ, (r = 1, · · · , l) are biases satisfying b1 ≤ · · · ≤ bl−1 ≤
bl = +∞. The models correspond to parallel hyperplanes
⟨w, x⟩−br = 0 separating grades r and r+1, (r = 1, · · · , l−1). Fig. 5 Transformation to pairwise classification.
Figure 3 illustrates the model. If x satisfies ⟨w, x⟩ − br−1 ≥ 0
and ⟨w, x⟩ − br < 0, then y = r, (r = 1, · · · , l). We can write
it as minr∈{1,···,l} {r|⟨w, x⟩ − br < 0}. order of pairs of objects and utilize the classifier in the rank-
Suppose that the training data is given as follows. For ing task. This is the idea behind the Ranking SVM method
each grade r = 1, · · · , l, there are mr instances: xr,i , i = proposed by Herbrich et al. [7].
1, · · · , mr . The learning task is formalized as the following Figure 4 shows an example of the ranking problem.
Quadratic Programming (QP) problem. Suppose that there are two groups of objects (documents
"l−1 "mr associated with two queries) in the feature space. Further
minw,b,ξ 21 ||w||2 + C r=1 ∗
i=1 (ξr,i + ξr+1,i ) suppose that there are three grades (levels). For example,
s. t. ⟨w, xr,i ⟩ + br ≥ 1 − ξr,i objects x1 , x2 , and x3 in the first group are at three differ-

⟨w, xr+1,i ⟩ + br ≤ 1 − ξr+1,i ent grades. The weight vector w corresponds to the linear

ξr,i ≥ 0, ξr+1,i ≥ 0 function f (x) = ⟨w, x⟩ which can score and rank the objects.
i = 1, · · · , mr , r = 1, · · · , l − 1 Ranking objects with the function is equivalent to projecting
m = m1 + · · · + ml , the objects into the vector and sorting the objects according
where xr,i denotes the i-th instance in the r-th grade, ξr+1,i to the projections on the vector. If the ranking function is

and ξr+1,i denote the corresponding slack variables, || · || de- ‘good’, then there should be an effect that objects at grade 3
notes L2 norm, m denotes the number of training instances, are ranked ahead of objects at grade 2, etc. Note that objects
and C > 0 is a coefficient. The method tries to separate the belonging to different groups are incomparable.
instances in the neighboring grades with the same margin. Figure 5 shows that the ranking problem in Fig. 4 can
be transformed to Linear SVM classification. The differ-
4. Pairwise Approach ences between two feature vectors at different grades in the
same group are treated as new feature vectors, e.g., x1 − x2 ,
In the pairwise approach, ranking is transformed into pair- x1 − x3 , and x2 − x3 . Furthermore, labels are also assigned
wise classification or pairwise regression. In the former to the new feature vectors. For example, x1 − x2 , x1 − x3 ,
case, a classifier for classifying the ranking orders of doc- and x2 − x3 are positive. Note that feature vectors at the
ument pairs is created and is employed in the ranking of same grade or feature vectors from different groups are not
documents. In the pairwise approach, the group structure of utilized to create new feature vectors. One can train a Lin-
ranking is also ignored. ear SVM classifier which separates the new feature vectors
The pairwise approach includes Ranking SVM [7], as shown in Fig. 5. Geometrically, the margin in the SVM
RankBoost [8], RankNet [9], GBRank [14], IR SVM [15], model represents the closest distance between the projec-
Lambda Rank [16], and LambdaMART [17]. We introduce tions of object pairs in two grades. Note that the hyperplane
Ranking SVM and IR SVM in this article. of the SVM classifier passes the original and the positive
and negative instances form corresponding pairs. For exam-
4.1 Ranking SVM ple, x1 − x2 and x2 − x1 are positive and negative instances
respectively. The weight vector w of the SVM classifier cor-
We can learn a classifier, such as SVM, for classifying the responds to the ranking function. In fact, we can discard the
LI: A SHORT INTRODUCTION TO LEARNING TO RANK
1859

Grade: 3, 2, 1
Documents are represented by their grades
Example 1:
ranking-1: 2 3 2 1 1 1 1
ranking-2: 3 2 1 2 1 1 1
Example 2:
ranking for query-1: 3 2 2 1 1 1 1
ranking for query-2: 3 3 2 2 2 1 1 1 1 1
Fig. 6 Example ranking lists.

negative instances in learning, because they are redundant. Fig. 7 Modified hinge loss functions.
Training data is given as {((xi(1) , xi(2) ), yi )}, i = 1, · · · , m
where each instance consists of two feature vectors
(xi(1) , xi(2) ) and a label yi ∈ {+1, −1} denoting which feature
training of Ranking SVM, which is not desirable. Ranking-
vector should be ranked ahead.
2 should be better than ranking-1, from the viewpoint of IR,
The learning of Ranking SVM is formalized as the fol-
because the result on its top is better. Note that to have high
lowing QP problem.
accuracy on top-ranked documents is crucial for an IR sys-
" tem, which is reflected in the IR evaluation measures.
minw,ξ 12 ||w||2 + C m i=1 ξi
s. t. yi ⟨w, xi(1) − xi(2) ⟩ ≥ 1 − ξi Another issue with Ranking SVM is that it equally
ξi ≥ 0 treats document pairs from different queries. In example
i = 1, . . . , m, 2, there are two queries and the numbers of documents as-
sociated with them are different. For query-1 there are 2
where xi(1) and xi(2) denote the first and second feature vectors document pairs between grades 3-2, 4 document pairs be-
in a pair of feature vectors, || · || denotes L2 norm, m denotes tween grades 3-1, 8 document pairs between grades 2-1, and
the number of training instances, and C > 0 is a coefficient. in total 14 document pairs. For query-2, there are 31 doc-
It is equivalent to the following non-constrained opti- ument pairs. Ranking SVM takes 14 instances (document
mization problem, i.e., the minimization of the regularized pairs) from query-1 and 31 instances (document pairs) from
hinge loss function. query-2 for training. Thus, the impact on the ranking model
from query-2 will be larger than the impact from query-1. In
m
! other words, the model learned will be biased toward query-
min [1 − yi ⟨w, xi(1) − xi(2) ⟩]+ + λ||w||2 , (5)
w 2. This is in contrast to the fact that in IR evaluation queries
i=1
are evenly important. Note that the numbers of documents
where [x]+ denotes function max(x, 0) and λ = 1
2C .
usually vary from query to query.
IR SVM addresses the above two problems by chang-
4.2 IR SVM ing the 0-1 pairwise classification into a cost sensitive pair-
wise classification. It does so by modifying the hinge loss
IR SVM proposed by Cao et al. [15] is an extension of Rank- function of Ranking SVM. Specifically, it sets different
ing SVM for Information Retrieval (IR), whose idea can be losses for document pairs across different grades and from
applied to other applications as well. different queries. To emphasize the importance of correct
Ranking SVM transforms ranking into pairwise clas- ranking on the top, the loss function heavily penalizes er-
sification, and thus it actually makes use of the 0-1 loss in rors related to the top. To increase the influence of queries
the learning process. There exists a gap between the loss with less documents, the loss function heavily penalizes er-
function and the IR evaluation measures. IR SVM attempts rors from the queries.
to bridge the gap by modifying 0-1 loss, that is, conducting Figure 7 plots the shapes of different hinge loss func-
cost sensitive learning of Ranking SVM. tions with different penalty parameters. The x-axis repre-
We first look at the problems caused by straightforward sents y f (x) and the y-axis represents loss. When y f (xi(1) −
application of Ranking SVM to document retrieval, using xi(2) ) ≥ 1, the losses are zero. When y f (xi(1) − xi(2) ) < 1,
examples in Fig. 6. the losses are represented by linearly decreasing functions
One problem with the direction application of Ranking with different slopes. If the slope equals −1, then the func-
SVM is that Ranking SVM equally treats document pairs tion is the normal hinge loss function. IR SVM modifies
across different grades. Example 1 indicates the problem. the hinge loss function, specifically modifies the slopes for
There are two rankings for the same query. The documents different grade pairs and different queries. It assigns higher
at positions 1 and 2 are swapped in ranking-1 from the per- weights to document pairs across important grade pairs and
fect ranking, while the documents at positions 3 and 4 are assigns normalization weights to document pairs according
swapped in ranking-2 from the perfect ranking. There is to queries.
only one error for each ranking in terms of the 0-1 loss, or The learning of IR SVM is equivalent to the following
difference in order of pairs. They have the same effect on the optimization problem. Specifically, the minimization of the
IEICE TRANS. INF. & SYST., VOL.E94–D, NO.10 OCTOBER 2011
1860

modified regularized hinge loss function, Objects: A, B, C


fA = ⟨w, xA ⟩, fB = ⟨w, xB ⟩, fC = ⟨w, xC ⟩
m
! Suppose fA > fB > fC
min τk(i) µq(i) [1 − yi ⟨w, xi(1) − xi(2) ⟩]+ + λ||w||2 , For example:
w Permutation1: ABC
i=1
Permutation2: ACB
1
where [x]+ denotes max(x, 0), λ = 2C , and τk(i) and µq(i) are S ABC = 16 ⟨w, ((xA − xB ) + (xB − xC ) + (xA − xC ))⟩
weights. See the loss function of Ranking SVM (5). S ACB = 16 ⟨w, ((xA − xC ) + (xC − xB ) + (xA − xB ))⟩
S ABC > S ACB
Here τk(i) represents the weight of instance (document
pair) i whose label pair belongs to the k-th type. Xu et al. Fig. 8 Example of scoring function.
propose a heuristic method to determine the value of τk . The
method takes the average drop in NDCG@1 when randomly
changing the positions of documents belonging to the grade where w is still the weight vector and vector σ(xi , πi ) is de-
pair as the value of a grade pair τk . Moreover, µq(i) rep- fined as
resents the weight of instance (document pair) i which is 2 !
from query q. The value of µq(i) is simply determined by σ(xi , πi ) = [zkl (xik − xil )],
ni (ni − 1) k,l:k<l
1
|nq | , where nq is the number of document pairs for query q.
The equivalent QP problem is as below. where zkl = +1, if πi (k) < πi (l) (xik is ranked ahead of xil in
" πi ), and zkl = −1, otherwise. Recall that ni is the number of
minw,ξ 21 ||w||2 + Ci m i=1 ξi documents associated with query qi .
s. t. yi ⟨w, xi(1) − xi(2) ⟩ ≥ 1 − ξi , For query qi , we can calculate S (xi , πi ) for each permu-
τ µ
Ci = k(i)2λq(i) tation πi and select the permutation π̃i with the largest score:
ξi ≥ 0,
i = 1, . . . , m. π̃i = arg max S (xi , πi ), (7)
πi ∈Πi

5. Listwise Approach where Πi denotes the set of all possible permutations for xi .
It can be easily shown that the ranking π̃i selected by
The listwise approach addresses the ranking problem in a Eq. (7) is equivalent to the ranking created by the ranking
more straightforward way. Specifically, it takes ranking lists model f (xi j ) (when both of them are linear functions). Fig-
as instances in both learning and prediction. The group ure 8 gives an example. It is easy to verify that both f (x)
structure of ranking is maintained and ranking evaluation and S (xi , π) will output ABC as the most preferable ranking
measures can be more directly incorporated into the loss (permutation).
functions in learning. In learning, we would ideally create a ranking model
The listwise approach includes ListNet [18], ListMLE that can maximize the accuracy in terms of a listwise eval-
[19], AdaRank [20], SVM MAP [21], and Soft Rank [22]. uation measure on training data, or equivalently, minimizes
SVM MAP and related methods are explained in this article. the loss function defined below,
m
!
5.1 SVM MAP
L( f ) = (E(π∗i , yi ) − E(πi , yi )), (8)
i=1
The algorithm SVM MAP developed by Yue et al. [21] is
designed to directly optimize MAP [2], but it can be easily where πi is the permutation on feature vector xi by ranking
extended to optimize NDCG. Xu et al. [23] further general- model f and yi is the corresponding list of grades. E(πi , yi )
ize it to a group of algorithms. denotes the evaluation result of πi in terms of an evaluation
In ranking, for query qi the ranking model f (xi j ) as- measure (e.g., NDCG). Usually E(π∗i , yi ) = 1.
signs a score to each associated document di j or feature vec- We view the problem of learning a ranking model as
tor xi j where xi j is the feature vector derived from qi and di j . the following optimization problem in which the following
The documents di (feature vectors xi ) are then sorted based loss function is minimized.
on their scores and a permutation denoted as πi is obtained. "m $ %
For simplicity, suppose that the ranking model f (xi j ) is a i=1 maxπ∗i ∈Π∗i ;πi ∈Πi \Π∗i E(π∗i , yi ) − E(πi , yi )
& ' (9)
linear model: · [S (xi , π∗i ) ≤ S (xi , πi ) ],
f (xi j ) = ⟨w, xi j ⟩, (6) where [[c]] is one if condition c is satisfied, otherwise it is
zero. π∗i ∈ Π∗i ⊆ Πi denotes any of the perfect permutations
where w denotes a weight vector.
for qi .
Suppose that labels for the feature vectors xi are also
The loss function measures the loss when the most pre-
given as yi . We consider using a scoring function S (xi , πi )
ferred ranking list by the ranking model is not the perfect
to measure the goodness of ranking πi . S (xi , πi ) is defined
ranking list. One can prove that the true loss function such
as
as that in (8) is upper-bounded by the new loss function in
S (xi , πi ) = ⟨w, σ(xi , πi )⟩, (9).
LI: A SHORT INTRODUCTION TO LEARNING TO RANK
1861

The loss function (9) is still not continuous and differ-


entiable. We can consider using continuous, differentiable, References
and even convex upper bounds of the loss function (9).
[1] T.Y. Liu, “Learning to rank for information retrieval,” Foundations
1) The 0-1 function in (9) can be replaced with its upper
and Trends in Information Retrieval, vol.3, no.3, pp.225–331, 2009.
bounds, for example, hinge functions, yielding [2] H. Li, “Learning to rank for information retrieval and natural lan-
"m $ % guage processing,” Synthesis Lectures on Human Language Tech-
i=1 maxπ∗i ∈Π∗i ,πi ∈Πi \Π∗i E(π∗i , yi ) − E(πi , yi ) · nologies, Morgan & Claypool, 2011.
& $ %'
1 − S (xi , π∗i ) − S (xi , πi ) [3] W.B. Croft, D. Metzler, and T. Strohman, Search Engines - Informa-
+
tion Retrieval in Practice, Pearson Education, 2009.
"m & $$ % [4] T.Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li, “LETOR: Benchmark
i=1 maxπ∗i ∈Π∗i ,πi ∈Πi \Π∗i E(π∗i , yi ) − E(πi , yi ) dataset for research on learning to rank for information retrieval,”
$ %%'
− S (xi , π∗i ) − S (xi , πi ) , Proc. SIGIR 2007 Workshop on Learning to Rank for Information
+
Retrieval, 2007.
2) The max function can also be replaced " with its upper [5] K. Järvelin and J. Kekäläinen, “IR evaluation methods for retrieving
bound, the sum function. This is because i xi ≥ maxi xi if highly relevant documents,” Proc. 23rd annual international ACM
xi ≥ 0 holds for all i. SIGIR conference on Research and development in information re-
trieval, pp.41–48, SIGIR ’00, New York, NY, USA, 2000.
3) Relaxations 1 and 2 can be applied simultaneously.
[6] D. Cossock and T. Zhang, “Subset ranking using regression,” COLT
For example, using the hinge function and taking the ’06: Proc. 19th Annual Conference on Learning Theory, pp.605–
true loss as 1.0 − MAP, we obtain SVM MAP. More pre- 619, 2006.
cisely, SVM MAP solves the following QP problem: [7] R. Herbrich, T. Graepel, and K. Obermayer, Large Margin rank
" boundaries for ordinal regression, MIT Press, Cambridge, MA,
minw;ξ≥0 21 ||w||2 + Cm m i=1 ξi 2000.
s.t. ∀i, ∀π∗i ∈ Π∗i , ∀πi ∈ Πi \ Π∗i : (10) [8] Y. Freund, R.D. Iyer, R.E. Schapire, and Y. Singer, “An efficient
S (xi , π∗i ) − S (xi , πi ) ≥ E(π∗i , yi ) − E(πi , yi ) − ξi , boosting algorithm for combining preferences,” J. Machine Learning
Research, vol.4, pp.933–969, 2003.
where C is a coefficient and ξi is the maximum loss among [9] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamil-
all the losses for permutations of query qi . ton, and G. Hullender, “Learning to rank using gradient descent,”
Equivalently, SVM MAP minimizes the following reg- ICML ’05: Proc. 22nd international conference on Machine learn-
ing, pp.89–96, 2005.
ularized hinge loss function [10] W. Chen, T.Y. Liu, Y. Lan, Z.M. Ma, and H. Li, “Ranking measures
"m & ∗ and loss functions in learning to rank,” NIPS ’09, 2009.
i=1 maxπi ∈Πi ;πi ∈Πi \Πi (E(π
∗ ∗ ∗ , y ) − E(πi , yi ))
' i i (11) [11] P. Li, C. Burges, and Q. Wu, “McRank: Learning to rank using mul-
− (S (xi , πi ) − S (xi , πi )) + λ||w||2 .

tiple classification and gradient boosting,” in Advances in Neural In-
+
formation Processing Systems 20, ed. J. Platt, D. Koller, Y. Singer,
Intuitively, the first term calculates the total maximum loss and S. Roweis, pp.897–904, MIT Press, Cambridge, MA, 2008.
when selecting the best permutation for each of the queries. [12] K. Crammer and Y. Singer, “Pranking with ranking,” NIPS, pp.641–
Specifically, if the difference between the permutations 647, 2001.
[13] A. Shashua and A. Levin, “Ranking with large margin principle:
S (xi , π∗i ) − S (xi , πi ) is less than the difference between the
Two approaches,” in Advances in Neural Information Processing
corresponding evaluation measures E(π∗i , yi )− E(πi , yi ), then Systems 15, ed. S.T.S. Becker and K. Obermayer, MIT Press, 2003.
there will be a loss, otherwise not. Next, the maximum loss [14] Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun,
is selected for each query and they are summed up over all “A general boosting method and its application to learning ranking
the queries. functions for web search,” in Advances in Neural Information Pro-
Since c · [[x ≤ 0]] < [c − x]+ holds for all c ∈ ℜ+ and cessing Systems 20, ed. J. Platt, D. Koller, Y. Singer, and S. Roweis,
pp.1697–1704, MIT Press, Cambridge, MA, 2008.
x ∈ ℜ, it is easy to see that the loss in (11) also bounds the [15] Y. Cao, J. Xu, T.Y. Liu, H. Li, Y. Huang, and H.W. Hon, “Adapting
true loss function in (8). ranking SVM to document retrieval,” SIGIR’ 06, pp.186–193, 2006.
[16] C. Burges, R. Ragno, and Q. Le, “Learning to rank with nonsmooth
6. Ongoing and Future Work cost functions,” in Advances in Neural Information Processing Sys-
tems 18, pp.395–402, MIT Press, Cambridge, MA, 2006.
It is still necessary to develop more advanced technolo- [17] Q. Wu, C.J.C. Burges, K.M. Svore, and J. Gao, “Adapting boosting
for information retrieval measures,” Inf. Retr., vol.13, no.3, pp.254–
gies for learning to rank. There are also many open ques-
270, 2010.
tions with regard to theory and applications of learning to [18] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li, “Learning to rank:
rank [2], [24]. Current and future research directions include From pairwise approach to listwise approach,” ICML ’07: Proc.
24th international conference on Machine learning, pp.129–136,
• training data creation 2007.
• semi-supervised learning and active learning [19] F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li, “Listwise approach
• feature learning to learning to rank: Theory and algorithm,” ICML ’08: Proc. 25th
• scalable and efficient training international conference on Machine learning, pp.1192–1199, New
• domain adaptation and multi-task learning York, NY, USA, 2008.
[20] J. Xu and H. Li, “AdaRank: A boosting algorithm for information
• ranking by ensemble learning
retrieval,” SIGIR ’07: Proc. 30th annual international ACM SIGIR
• global ranking conference on Research and development in information retrieval,
• ranking of nodes in a graph. pp.391–398, New York, NY, USA, 2007.
IEICE TRANS. INF. & SYST., VOL.E94–D, NO.10 OCTOBER 2011
1862

[21] Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vector


method for optimizing average precision,” Proc. 30th annual inter-
national ACM SIGIR conference, pp.271–278, 2007.
[22] M. Taylor, J. Guiver, S. Robertson, and T. Minka, “SoftRank: Op-
timizing non-smooth rank metrics,” WSDM ’08: Proc. international
conference on Web search and web data mining, pp.77–86, New
York, NY, USA, 2008.
[23] J. Xu, T.Y. Liu, M. Lu, H. Li, and W.Y. Ma, “Directly optimizing
evaluation measures in learning to rank,” SIGIR ’08: Proc. 31st an-
nual international ACM SIGIR conference on Research and devel-
opment in information retrieval, pp.107–114, New York, NY, USA,
2008.
[24] O. Chapelle, Y. Chang, and T.Y. Liu, “Future directions in learning
to rank,” J. Machine Learning Research - Proceedings Track, vol.14,
pp.91–100, 2011.

Hang Li is senior researcher and research


manager in Web Search and Mining Group at
Microsoft Research Asia. He joined Microsoft
Research in June 2001. Prior to that, He worked
at the Research Laboratories of NEC Corpora-
tion. He obtained a B.S. in Electrical Engineer-
ing from Kyoto University in 1988 and a M.S.
in Computer Science from Kyoto University in
1990. He earned his Ph.D. in Computer Sci-
ence from the University of Tokyo in 1998. He
is interested in statistical learning, information
retrieval, data mining, and natural language processing.

You might also like