0% found this document useful (0 votes)
14 views25 pages

Three Approaches To Ordinal Classification (Slides 2009)

Uploaded by

carlosgg33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views25 pages

Three Approaches To Ordinal Classification (Slides 2009)

Uploaded by

carlosgg33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Three Approaches to Ordinal Classification

Krzysztof Dembczyński, Wojciech Kotłowski

Institute of Computing Science


Poznań University of Technology

EURO 2009, Bonn, July 8, 2009


1 Three Approaches to Ordinal Classification

2 Boosting-like Approach

3 Ordinal Matrix Factorization

4 Conclusions
Ordinal classification consists in predicting a label ta-
ken from a finite and ordered set for an object described
by some attributes.

This problem shares some characteristics of multi-class


classification and regression, but:
• the order between class labels cannot be neglected,
• the scale of the decision attribute is not cardinal.
Recommender system predicting a rating of a movie for a gi-
ven user.
Email filtering to ordered groups like: important, normal, later,
or spam.
Nature of ordinal classification:
• Classification with ordered class labels?
• Degenerate ranking problem?
1 Three Approaches to Ordinal Classification

2 Boosting-like Approach

3 Ordinal Matrix Factorization

4 Conclusions
Denotation:
• K – number of classes
• y – actual label
• ŷ – predicted label
• x – attributes
• f (x) – prediction (ranking or utility) function
• L(·) – loss function
• J·K – Boolean test
Ordinal Classification – Probability Estimation:
• Prediction risk is defined by a loss matrix:

L(y, ŷ) = (ly,ŷ )K×K

with v-shaped rows and zeros on diagonal.

 
0 1 2 3
 1 0 1 2 
L(y, ŷ) = 
 
2 1 0 1

 
3 2 1 0
Ordinal Classification – Probability Estimation:
• Bayes decision for the loss matrix L(y, ŷ) is given by:

K
ŷ ∗ = arg min
X
Pr(y = k|x)L(k, ŷ).

k=1

• To solve the problem, we need to estimate conditional


probabilities Pr(y = k|x) – a lot of algorithms . . .
• We can decompose the problem to K − 1 binary problems by
utilizing the order of labels y: the result then are estimates of
Pr(y > k|x), k = 1, . . . , K − 1.
• To satisfy monotonicity of Pr(y > k|x), k = 1, . . . , K − 1,
we use isotonic regression.
• Other possibilities allowed . . .
Ordinal Classification – Probability Estimation:
• Given Pr(y = k|x), k = 1, . . . , K, the optimal prediction is:


 arg maxk Pr(y = k|x), for lyŷ = Jy 6= ŷK,




ŷ ∗ = median(y|x), for lyŷ = |y − ŷ|,





E(y|x), for lyŷ = (y − ŷ)2 .

• Absolute-error loss seems to be the most natural since its


Bayes decision is median that does not depend on scale of
labels.
• Any function of the probability distribution can be used for
object ranking.
Ordinal Classification – Degenerate Ranking:
• Prediction risk is defined by a rank loss computed over pairs
of objects:

L (y◦• , f (x◦ ), f (x• )) = Jy◦• (f (x◦ ) − f (x• )) ¬ 0K,

where
y◦• = sgn(y◦ − y• ),
and f (x) is a ranking (or utility) function.

yi1 > y i2 > y i3 > ... > yiN −1 > y iN


f (xi1 ) > f (xi3 ) > f (xi2 ) > . . . > f (xiN −1 ) > f (xiN )
Ordinal Classification – Degenerate Ranking:
• This approach ranks the objects.
• To assign class labels, one has to compute thresholds on a
range of the ranking function with respect to a given loss
matrix.
• Rank loss minimization is strictly connected with
maximization of AUC criterion used in binary classification.
• Minimization of rank loss on training set has quadratic
complexity with respect to number of object, however, in the
case of K ordered classes, the algorithm can work in linear
time.
Ordinal Classification – Threshold Loss:
• Prediction risk is defined by threshold loss:

K−1
X
L(y, f (x), θ) = Jyk (f (x) − θk ) ­ 0K,
k=1

where θ = (θ0 , . . . , θK ) are consecutive thresholds to be


computed simultaneously with f (x), and

yk = 1, if y > k, and yk = −1, otherwise y ¬ k.

θ0 = − ∞ ... θ1 = − 3.5 θ2 = − 1.2 ... θk−−1 = 1.2 θk−−2 = 3.8 ... θK = ∞

−5 −4 −3 −2 −1 0 1 2 3 4 5
f (x)
Ordinal Classification – Threshold Loss:
• This approach shares characteristics of the previous two.
• Comparison of an object to thresholds instead to all other
training objects – lower complexity, but linear algorithms
exist for rank loss minimization in ordinal classification
settings.
• Joint solution for all K − 1 binary problems – no need of
isotonization of conditional probabilities, but the result is a
single value.
• Weighted threshold loss can approximate any loss matrix.
1 Three Approaches to Ordinal Classification

2 Boosting-like Approach

3 Ordinal Matrix Factorization

4 Conclusions
Boosting-like Algorithms for Three Approaches:
• Prediction function is an ensemble of decision rules:
M
X
f (x) = α0 + rm (x).
m=1

• We used boosting approach to learn f (x): in each iteration,


a single rule is generated by concentrating on examples which
were hardest to classify correctly by previous rules with
respect to a given loss function.
Boosting-like Algorithms for Three Approaches:
• Ordinal ENDER – decomposes the problem into a
sequence of binary problems for estimating Pr(y > k|x);
uses isotonic regression for isotonization of the estimates;
final prediction is median over computed class distribution.
• RankRules – minimizes (exponential) rank loss;
parameterized to minimize absolute-error.
• ORDER – minimizes (exponential) threshold loss;
parameterized to minimize absolute-error.

• ENDER-Abs – reference algorithm constructing ensemble of


decision rules by direct minimization of absolute-error.

All the algorithms work in linear time with respect to number of


training example (plus log-linear time for sorting used once in
preprocessing phase).
Experimental Results:
• Comparison of Ordinal ENDER, RankRules, RankRules and
ENDER AE.
• 19 benchmark sets taken from Luis Torgo repository –
transformed from regression to ordinal classification settings.
• Average ranks are computed with repect to mean absolute
error obtained on each data set.
• Critical difference in average ranks is CD = 1.076.

CD = 1.076

ENDER−Abs RankRules
ORDER
Ordinal ENDER

4 3 2 1
Experimental Results:
• There is almost no quantitative difference in performance
and time consumption: RankRules is slightly slower.
• Qualitative differences: Ordinal ENDER is related to
probability estimation, but RankRules to AUC maximization.
• Ensemble of decision rules are competitive to: RankBoost
AE, ORBoost-All, SVM-IMC.
1 Three Approaches to Ordinal Classification

2 Boosting-like Approach

3 Ordinal Matrix Factorization

4 Conclusions
Ordinal Matrix Factorization:
• Given sparse matrix Y of observed values build a model
based on matrix factorization:

Y ' Ŷ = UVT

where U is an I × M and VT is a M × J matrix.


• The prediction is then defined by:

M
X
ŷij = uim vjm .
m=1

• Example: I is the number of users, J is the number of


movies in the movie recommender system, and M is number
of features describing users and movies.
• For learning we use gradient descent applied alternately to
U and V matrices with respect to a given loss function.
Ordinal Matrix Factorization for Three Approaches:
• Decomposition schema for probability estimation.
• Minimization of rank loss.
• Minimization of threshold loss.
• Hypothesis: all the approaches perform similarly.
• For all three approaches linear algorithms exists: minimization
of (exponential) rank loss, however, is the most demanding.
• No satisfactory results yet :(
• Work in progress . . .
1 Three Approaches to Ordinal Classification

2 Boosting-like Approach

3 Ordinal Matrix Factorization

4 Conclusions
Conclusions:
• Nature of ordinal classification?
• Three approaches to ordinal classification.
• Boosting-like algorithm: rather qualitative than quantitative
differences between these approaches.
• Ordinal Matrix factorization: in progress . . .

You might also like