0% found this document useful (0 votes)

79 views19 pages

Fath On y 2017 Adversarial

This document presents an adversarial approach for ordinal regression that derives uniquely defined surrogate loss functions. It seeks the predictor that is robust to worst-case approximations of training data labels, while matching certain training data statistics. The approach is shown to outperform existing methods that use hinge loss approximations on UCI ordinal prediction tasks. Existing methods are discussed, including threshold-based methods that reduce ordinal regression to binary classification problems using surrogate losses like hinge loss. The proposed adversarial approach instead derives novel surrogate losses tailored for ordinal regression.

Uploaded by

Mochamad Asryl Aziz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views19 pages

Fath On y 2017 Adversarial

Uploaded by

Mochamad Asryl Aziz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Adversarial Surrogate Losses for Ordinal Regression

Rizal Fathony Mohammad Bashiri Brian D. Ziebart

Department of Computer Science
University of Illinois at Chicago
Chicago, IL 60607
{rfatho2, mbashi4, bziebart}@uic.edu

Abstract
Ordinal regression seeks class label predictions when the penalty incurred for
mistakes increases according to an ordering over the labels. The absolute error
is a canonical example. Many existing methods for this task reduce to binary
classification problems and employ surrogate losses, such as the hinge loss. We
instead derive uniquely defined surrogate ordinal regression loss functions by
seeking the predictor that is robust to the worst-case approximations of training
data labels, subject to matching certain provided training data statistics. We
demonstrate the advantages of our approach over other surrogate losses based on
hinge loss approximations using UCI ordinal prediction tasks.

1 Introduction
For many classification tasks, the discrete class labels being predicted have an inherent order (e.g.,
poor, fair, good, very good, and excellent labels). Confusing two classes that are distant from one
another (e.g., poor instead of excellent) is more detrimental than confusing two classes that are nearby.
The absolute error, |y y| between label prediction (y Y) and actual label (y Y) is a canonical
ordinal regression loss function. The ordinal regression task seeks class label predictions for new
datapoints that minimize losses of this kind.
Many prevalent methods reduce the ordinal regression task to subtasks solved using existing super-
vised learning techniques. Some view the task from the regression perspective and learn both a linear
regression function and a set of thresholds that define class boundaries [15]. Other methods take a
classification perspective and use tools from cost-sensitive classification [68]. However, since the
absolute error of a predictor on training data is typically a non-convex (and non-continuous) function
of the predictors parameters for each of these formulations, surrogate losses that approximate the
absolute error must be optimized instead. Under both perspectives, surrogate losses for ordinal regres-
sion are constructed by transforming the surrogate losses for binary zero-one loss problemssuch as
the hinge loss, the logistic loss, and the exponential lossto take into account the different penalties
of the ordinal regression problem. Empirical evaluations have compared the appropriateness of
different surrogate losses, but these still leave the possibility of undiscovered surrogates that align
better with the ordinal regression loss.
To address these limitations, we seek the most robust [9] ordinal regression predictions by focusing
on the following adversarial formulation of the ordinal regression task: what predictor best minimizes
absolute error in the worst case given partial knowledge of the conditional label distribution? We
answer this question by considering the Nash equilibrium for a game defined by combining the loss
function with Lagrangian potential functions [10]. We derive a surrogate loss function for empirical
risk minimization that realizes this same adversarial predictor. We show that different types of
available knowledge about the conditional label distribution lead to thresholded regression-based
predictions or classification-based predictions. In both cases, the surrogate loss is novel compared to
existing surrogate losses. We also show that our surrogate losses enjoy Fisher consistency, a desirable

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
theoretical property guaranteeing that minimizing the surrogate loss produces Bayes optimal decisions
for the original loss in the limit. We develop two different approaches for optimizing the loss: a
stochastic optimization of the primal objective and a quadratic program formulation of the dual
objective. The second approach enables us to efficiently employ the kernel trick to provide a richer
feature representation without an overly burdensome time complexity. We demonstrate the benefits
of our adversarial formulation over previous ordinal regression methods based on hinge loss for a
range of prediction tasks using UCI datasets.

2 Background and Related Work

2.1 Ordinal Regression Problems

Ordinal regression is a discrete label prediction problem characterized by Table 1: Ordinal re-
an ordered penalty for making mistakes: loss(y1 , y) < loss(y2 , y) if y < gression loss matrix.
y1 < y2 or y > y1 > y2 . Though many loss functions possess this property,
0 1 2 3

the absolute error |y y| is the most widely studied. We similarly restrict
1 0 1 2
our consideration to this loss function in this paper. The full loss matrix 2 1 0 1
L for absolute error with four labels is shown in Table 1. The expected
3 2 1 0
loss incurred using a probabilistic predictor P (y|x) evaluated on true data
P
distribution P (x, y) is: EX,Y P ;Y |XP [LY ,Y ] = x,y,y P (x, y)P (y|x)Ly,y . The supervised
learning objective for this problem setting is to construct a probabilistic predictor P (y|x) in a way
that minimizes this expected loss using training samples distributed according to the empirical
distribution P (x, y), which are drawn from the unknown true data generating distribution, P (x, y).
A nave ordinal regression approach relaxes the task to a continuous prediction problem, minimizes
the least absolute deviation [11], and then rounds predictions to nearest integral label [12]. More
sophisticated methods range from using a cumulative link model [13] that assumes the cumulative
conditional probability P (Y j|x) follows a link function, to Bayesian non-parametric approaches
[14] and many others [1522]. We narrow our focus over this broad range of methods found in the
related work to those that can be viewed as empirical risk minimization methods with piece-wise
convex surrogates, which are more closely related to our approach.

2.2 Threshold Methods for Ordinal Regression

Threshold methods are one popular family of techniques that treat the ordinal response variable,
f , w x, as a continuous real-valued variable and introduce |Y| 1 thresholds 1 , 2 , ..., |Y|1
that partition the real line into |Y| segments: 0 = < 1 < 2 < ... < |Y|1 < |Y| =
[4]. Each segment corresponds to a label with yi assigned label j if j1 < f j . There are two
different approaches for constructing surrogate losses based on the threshold methods to optimize the
choice of w and 1 , . . . , |Y|1 : one is based on penalizing all thresholds involved when a mistake is
made and one is based on only penalizing the most immediate thresholds.
All thresholds methods penalize every erroneous threshold using a surrogate loss, , for sets of binary
Py1 P|Y|
classification problems: lossAT (f, y) = k=1 ((k f)) + k=y (k f). Shashua and Levin
[1] studied the hinge loss under the name of support vector machines with a sum-of margin strategy,
while Chu and Keerthi [2] proposed a similar approach under the name of support vector ordinal
regression with implicit constraints (SVORIM). Lin and Li [3] proposed ordinal regression boosting,
an all thresholds method using the exponential loss as a surrogate. Finally, Rennie and Srebro [4]
proposed a unifying approach for all threshold methods under a variety of surrogate losses.
Rather than penalizing all erroneous thresholds when an error is made, immediate thresholds methods
only penalize the threshold of the true label and the threshold immediately beneath the true label:
lossIT (f, y) = ((y1 f)) + (y f).1 Similar to the all thresholds methods, immediate thresh-
old methods have also been studied in the literature under different names. For hinge loss surrogates,
Shashua and Levin [1] called the model support vector with fixed-margin strategy while Chu and
Keerthi [2] use the term support vector ordinal regression with explicit constraints (SVOREX). For
1
For the boundary labels, the method defines ((0 f)) = (y+1 f) = 0.

2
the exponential loss, Lin and Li [3] introduced ordinal regression boosting with left-right margins.
Rennie and Srebro [4] also proposed a unifying framework for immediate threshold methods.

2.3 Reduction Framework from Ordinal Regression to Binary Classification

Li and Lin [5] proposed a reduction framework to convert ordinal regression problems to binary
classification problems by extending training examples. For each training sample (x, y), the reduction
framework creates |Y| 1 extended samples (x(j) , y (j) ) and assigns weight wy,j to each extended
sample. The binary label associated with the extended sample is equivalent to the answer of the
question: is the rank of x greater than j? The reduction framework allows a choice for how extended
samples x(j) are constructed from original samples x and how to perform binary classification. If
the threshold method is used to construct the extended sample and SVM is used as the binary
classification algorithm, the classifier can be obtained by solving a family of quadratic optimization
problems that includes SVORIM and SVOREX as special instances.

2.4 Cost-sensitive Classification Methods for Ordinal Regression

Rather than using thresholding or the reduction framework, ordinal regression can also be cast as a
special case of cost-sensitive multiclass classification. Two of the most popular classification-based
ordinal regression techniques are extensions of one-versus-one (OVO) and one-versus-all (OVA) cost-
sensitive classification [6, 7]. Both algorithms leverage a transformation that converts a cost-sensitive
classification problem to a set of weighted binary classification problems. Rather than reducing
to binary classification, Tu and Lin [8] reduce cost-sensitive classification to one-sided regression
(OSR), which can be viewed as an extension of the one-versus-all (OVA) technique.

2.5 Adversarial Prediction

Foundational results establish a duality between adversarial logarithmic loss minimization and
constrained maximization of the entropy [23]. This takes the form of a zero-sum game between
a predictor seeking to minimize expected logarithmic loss and an adversary seeking to maximize
this same loss. Additionally, the adversary is constrained to choose a distribution that matches
certain sample statistics. Ultimately, through the duality to maximum entropy, this is equivalent
to maximum likelihood estimation of probability distributions that are members of the exponential
family [23]. Grnwald and Dawid [9] emphasize this formulation as a justification for the principle of
maximum entropy [24] and generalize the adversarial formulation to other loss functions. Extensions
to multivariate performance measures [25] and non-IID settings [26] have demonstrated the versatility
of this perspective.
Recent analysis [27, 28] has shown that for the special case of zero-one loss classification, this
adversarial formulation is equivalent to empirical risk minimization with a surrogate loss function:
X
AL0-1
f (xi , yi ) = max (j,yi (xi ) + |S| 1)/|S|, (1)
S{1,...,|Y|},S6=
jS

where j,yi (xi ) is the potential difference j,yi (xi ) = fj (xi ) fyi (xi ). This surrogate loss function
provides a key theoretical advantage compared to the Crammer-Singer hinge loss surrogate for
multiclass classification [29]: it guarantees Fisher consistency [27] while Crammer-Singerdespite
its popularity in many applications, such as Structured SVM [30, 31]does not [32, 33]. We extend
this type of analysis to the ordinal regression setting with the absolute error as the loss function in
this paper, producing novel surrogate loss functions that provide better predictions than other convex,
piece-wise linear surrogates.

3 Adversarial Ordinal Regression

3.1 Formulation as a zero-sum game

We seek the ordinal regression predictor that is the most robust to uncertainty given partial knowledge
of the evaluating distributions characteristics. This takes the form of a zero-sum game between a
predictor player choosing a predicted label distribution P (y|x) that minimizes loss and an adversarial

3
player choosing an evaluation distribution P (y|x) that maximizes loss while closely matching the
feature-based statistics of the training data:
h i
min max EXP ;Y |XP ;Y |XP Y Y such that: EXP ;Y |XP [(X, Y )] = . (2)

P (y|x) P (y|x)

The vector of feature moments, = EX,Y P [(X, Y )], is measured from sample training data
distributed according to the empirical distribution P (x, y).
An ordinal regression problem can be viewed as a cost-sensitive loss with the entries of the cost
matrix defined by the absolute loss between the row and column labels (an example of the cost
matrix for the case of a problem with four labels is shown in Table 1). Following the construction of
adversarial prediction games for cost-sensitive classification [10], the optimization of Eq. (2) reduces
to minimizing the equilibrium game values of a new set of zero-sum games characterized by matrix
L0xi ,w :

f1 fyi f|Y| fyi + |Y| 1

zero-sum game
Xz }| { f1 fyi + 1 f|Y| fyi + |Y| 2
max min pTxi L0xi ,w pxi ; L0xi ,w=

min .. .. .. , (3)
w pxi pxi
. . .

i
| {z } f1 fyi + |Y| 1 f|Y| fyi
convex optimization of w

where: w represents a vector of Lagrangian model parameters; fj = w (xi , j) is a Lagrangian

potential; pxi is a vector representation of the conditional label distribution, P (Y = j|xi ), i.e.,
pxi = [P (Y = 1|xi ) P (Y = 2|xi ) . . .]T ; and pxi is similarly defined. The matrix L0xi ,w =
(|y y| + fy fyi ) is a zero-sum game matrix for each example. This optimization problem (Eq. (3))
is convex in w and the inner zero-sum game can be solved using a linear program [10]. To address
finite sample estimation errors, the difference between expected and sample feature can be bounded
in Eq. (2), ||EXP ;Y |XP [(X, Y )] || , leading to Lagrangian parameter regularization in
Eq. (3) [34].

3.2 Feature representations

We consider two feature representations corresponding to different training data summaries:

yx I(y = 1)x

I(y 1) I(y = 2)x

th (x, y) = I(y 2)
; and
I(y = 3)x
mc (x, y) = . (4)
.. ..
. .
I(y |Y| 1) I(y = |Y|)x

The first, which we call the thresholded regression representation,

has size m + |Y| 1, where m is the dimension of our input space. It
induces a single shared vector of feature weights and a set of thresh-
olds. If we denote the weight vector associated with the yx term as
w and the terms associated with each sum of class indicator func-
tions as 1 , 2 , . . ., |Y|1 , then thresholds for switching between
class j and j + 1 (ignoring other classes) occur when w xi = j .
The second feature representation, mc , which we call the multi-
class representation, has size m|Y| and can be equivalently inter-
preted as inducing a set of class-specific feature weights, fj = wj xi .
This feature representation is useful when ordered labels cannot be
thresholded according to any single direction in the input space, as Figure 1: Example where mul-
shown in the example dataset of Figure 1. tiple weight vectors are useful.

4
3.3 Adversarial Loss from the Nash Equilibrium

We now present the main technical contribution of our paper: a surrogate loss function that, when
minimized, produces a solution to the adversarial ordinal regression problem of Eq. (3).2
Theorem 1. An adversarial ordinal regression predictor is obtained by choosing parameters w that
minimize the empirical risk of the surrogate loss function:
fj + fl + j l fj + j fl l
ALord
w (xi , yi ) = max fyi = max + max fyi , (5)
j,l{1,...,|Y|} 2 j 2 l 2
where fj = w (xi , j) for all j {1, . . . , |Y|}.

f +f +jl
Proof sketch. Let j , l be the solution of argmaxj,l{1,...,|Y|} j 2l , we show that the Nash
equilibrium value of a game matrix that contains only row j and l and column j and l from
f ,+fl +j l
matrix L0xi ,w is exactly j 2 . We then show that adding other rows and columns in L0xi ,w
to the game matrix does not change the game value. Given the resulting closed form solution of the
game (instead of a minimax), we can recast the adversarial framework for ordinal regression as an
empirical risk minimization with the proposed loss.

We note that the ALordw surrogate is the maximization over pairs of different potential functions
associated with each class (including pairs of identical class labels) added to the distance between the
pair. For both of our feature representations, we make use of the fact that maximization over each
element of the pair can be independently realized, as shown on the right-hand side of Eq. (5).

Thresholded regression surrogate loss

In the thresholded regression feature representation, the parameter contains a single shared vector of
feature weights w and |Y| 1 terms k associated with thresholds. Following Eq. (5), the adversarial
ordinal regression surrogate loss for this feature representation can be written as:
P P
ord-th
j(w xi + 1) + kj k l(w xi 1) + kl k X
AL (xi , yi ) = max + max yi w x i k .
j 2 l 2
kyi
(6)
This loss has a straight-forward interpreta-
tion in terms of the thresholded regression
perspective, as shown in Figure 2: it is
based on averaging the thresholded label
predictions for potentials w xi + 1 and
w xi 1. This penalization of the pair of
thresholds differs from the thresholded sur-
rogate losses of related work, which either
penalize all violated thresholds or penalize
only the thresholds adjacent to the actual
class label. Figure 2: Surrogate loss calculation for datapoint xi
(projected to w xi ) with a label prediction of 4 for pre-
Using a binary search procedure over
dictive purposes, the surrogate loss is instead obtained
1 , . . . , |Y|1 , the largest lower bounding
using potentials for the classes based on w xi +1 (label
threshold for each of these potentials can
5) and w xi 1 (label 2) averaged together.
be obtained in O(log |Y|) time.

Multiclass ordinal surrogate loss

In the multiclass feature representation, we have a set of specific feature weights wj for each label
and the adversarial multiclass ordinal surrogate loss can be written as:
wj xi + wl xi + j l
ALord-mc (xi , yi ) = max wyi xi . (7)
j,l{1,...,|Y|} 2
2
The detailed proof of this theorem and others are contained in the supplementary materials. Proof sketches
are presented in the main paper.

5
(a) (b) (c)

Figure 3: Loss function contour plots of ALord over the space of potential differences j , fj fyi
for the prediction task with three classes when the true label is yi = 1 (a), yi = 2 (b), and yi = 3 (c).

We can also view this as the maximization over |Y|(|Y| + 1)/2 linear hyperplanes. For an ordinal
regression problem with three classes, the loss has six facets with different shapes for each true label
value, as shown in Figure 3. In contrast with ALord-th , the class label potentials for ALord-mc may
differ from one another in more-or-less arbitrary ways. Thus, searching for the maximal j and l class
labels requires O(|Y|) time.

3.4 Consistency Properties

The behavior of a prediction method in ideal learning settingsi.e., trained on the true evaluation
distribution and given an arbitrarily rich feature representation, or, equivalently, considering the space
of all measurable functionsprovides a useful theoretical validation. Fisher consistency requires that
the prediction model yields the Bayes optimal decision boundary [32, 33, 35] in this setting. Given
the true label conditional probability Pj (x) , P (Y = j|x), a surrogate loss function is said to
be Fisher consistent with respect to the loss ` if the minimizer f of the surrogate loss achieves the
Bayes optimal risk, i.e.,:
f = argmin EY |XP [f (X, Y )|X = x] (8)
f
EY |XP [`f (X, Y )|X = x] = min EY |XP [`f (X, Y )|X = x] .
f

Ramaswamy and Agarwal [36] provide a necessary and sufficient condition for a surrogate loss to be
Fisher consistent with respect to general multiclass losses, which includes ordinal regression losses.
A recent analysis by Pedregosa et al. [35] shows that the all thresholds and the immediate thresholds
methods are Fisher consistent provided that the base binary surrogates losses they use are convex
with a negative derivative at zero.
For our proposed approach, the condition for Fisher consistency above is equivalent to:

X fj + fl + j l X
f = argmin Py max fy argmax fj (x) argmin Py |j y| . (9)
f y
j,l 2 j j
y

Since adding a constant to all fj does not change the value of both ALord
f and argmaxj fj (x), we
employ the constraint maxj fj (x) = 0, to remove redundant solutions for the consistency analysis.
We establish an important property of the minimizer for ALord
f in the following theorem.

ord
Theorem 2. The minimizer vector f of EY |XP ALf (X, Y )|X = x satisfies the loss reflective
property, i.e., it complements the absolute error by starting with a negative integer value, then
increasing by one until reaching zero, and then incrementally decreases again.

Proof sketch. We show that for any f 0 that does not satisfy the loss reflective property, we can
construct f 1 using several steps that satisfy the loss reflective property and has the expected loss value
less than the expected loss of f 0 .

6
Example vectors f that satisfy Theorem 2 are [0, 1, 2]T , [1, 0, 1]T and [2, 1, 0]T for
three-class problems, and [3, 2, 1, 0, 1] for five-class problems. Using the key property of the
minimizer above, we establish the consistency of our loss functions in the following theorem.
Theorem 3. The adversarial ordinal regression surrogate loss ALord from Eq. (5) is Fisher consistent.

Proof sketch. We only consider |Y| possible values of f that satisfy the loss reflective property. For
the f that corresponds to class j, the value of the expected loss is equal to the Bayes loss if we predict
j as the label. Therefore minimizing over f that satisfy the loss reflective property is equivalent to
finding the Bayes optimal response.

3.5 Optimization

3.5.1 Primal Optimization

To optimize the regularized adversarial ordinal regression loss from the primal, we employ stochastic
average gradient (SAG) methods [37, 38], which have been shown to converge faster than standard
stochastic gradient optimization. The idea of SAG is to use the gradient of each example from the last
iteration where it was selected to take a step. However, the nave implementation of SAG requires
storing the gradient of each sample, which may be expensive in terms of the memory requirements.
Fortunately, for our loss ALordw , we can drastically reduce this memory requirement by just storing a
f +f +jl
pair of number, (j , l ) = argmaxj,l{1,...,|Y|} j 2l , rather than storing the gradient for each
sample. Appendix C explains the details of this technique.

3.5.2 Dual Optimization

Dual optimization is often preferred when optimizing piecewise linear losses, such as the hinge loss,
since it enables one to easily perform the kernel trick and obtain a non-linear decision boundary
without heavily sacrificing computational efficiency. Optimizing the regularized adversarial ordinal
regression loss in the dual can be performed by solving the following quadratic optimization:
X 1 X
max j(i,j i,j ) (i,j + i,j ) (k,l + k,l ) ((xi , j) (xi , yi )) ((xk , l) (xl , yk ))
,
i,j
2
i,j,k,l
X X
subject to: i,j 0; i,j 0; i,j = C2 ; i,j = C2 ; i, k {1, . . . , n}; j, l {1, . . . , |Y|}. (10)
j j

Note that our dual formulation only depends on the dot product of the features. Therefore, we can
also easily apply the kernel trick to our algorithm. Appendix D describes the derivation from the
primal optimization to the dual optimization above.

4 Experiments Table 2: Dataset properties.

4.1 Experiment Setup
Dataset #class #train #test #features
We conduct our experiments on a benchmark diabetes 5 30 13 2
dataset for ordinal regression [14], evaluate the pyrimidines 5 51 23 27
performance using mean absolute error (MAE), triazines 5 130 56 60
and perform statistical tests on the results of dif- wisconsin 5 135 59 32
ferent hinge loss surrogate methods. The bench- machinecpu 10 146 63 6
mark contains datasets taken from the UCI Ma- autompg 10 274 118 7
chine Learning repository [39] ranging from rel- boston 5 354 152 13
atively small to relatively large in size. The char- stocks 5 665 285 9
abalone 10 2923 1254 10
acteristics of the datasets, including the number
bank 10 5734 2458 8
of classes, the training set size, the testing set computer 10 5734 2458 21
size, and the number of features, are described calhousing 10 14447 6193 8
in Table 2.
In the experiment, we consider different methods using the original feature space and a kernelized
feature space using the Gaussian radial basis function kernel. The methods that we compare include
two variations of our approach, the threshold based (ALord-th ), and the multiclass-based (ALord-mc ).

7
Table 3: The average of the mean absolute error (MAE) for each model. Bold numbers in each case
indicate that the result is the best or not significantly worse than the best (paired t-test with = 0.05).
Threshold-based models Multiclass-based models
Dataset
ord-th th ord-mc
AL RED AT IT AL REDmc CSOSR CSOVO CSOVA
diabetes 0.696 0.715 0.731 0.827 0.629 0.700 0.715 0.738 0.762
pyrimidines 0.654 0.678 0.615 0.626 0.509 0.565 0.520 0.576 0.526
triazines 0.607 0.683 0.649 0.654 0.670 0.673 0.677 0.738 0.732
wisconsin 1.077 1.067 1.097 1.175 1.136 1.141 1.208 1.275 1.338
machinecpu 0.449 0.456 0.458 0.467 0.518 0.515 0.646 0.602 0.702
autompg 0.551 0.550 0.550 0.617 0.599 0.602 0.741 0.598 0.731
boston 0.316 0.304 0.306 0.298 0.311 0.311 0.353 0.294 0.363
stocks 0.324 0.317 0.315 0.324 0.168 0.175 0.204 0.147 0.213
abalone 0.551 0.547 0.546 0.571 0.521 0.520 0.545 0.558 0.556
bank 0.461 0.460 0.461 0.461 0.445 0.446 0.732 0.448 0.989
computer 0.640 0.635 0.633 0.683 0.625 0.624 0.889 0.649 1.055
calhousing 1.190 1.183 1.182 1.225 1.164 1.144 1.237 1.202 1.601
average 0.626 0.633 0.629 0.661 0.613 0.618 0.706 0.652 0.797
# bold 5 5 4 2 5 5 2 2 1

The baselines we use for the threshold-based models include a SVM-based reduction framework
algorithm (REDth ) [5], an all threshold method with hinge loss (AT) [1, 2], and an immediate threshold
method with hinge loss (IT) [1, 2]. For the multiclass-based models, we compare our method with an
SVM-based reduction algorithm using multiclass features (REDmc ) [5], with cost-sensitive one-sided
support vector regression (CSOSR) [8], with cost-sensitive one-versus-one SVM (CSOVO) [7], and
with cost-sensitive one-versus-all SVM (CSOVA) [6]. For our Gaussian kernel experiment, we
compare our threshold based model (ALord-th ) with SVORIM and SVOREX [2].
In our experiments, we first make 20 random splits of each dataset into training and testing sets. We
performed two stages of five-fold cross validation on the first split training set for tuning each models
regularization constant . In the first stage, the possible values for are 2i , i = {1, 3, 5, 7, 9, 11, 13}.
i
Using the best in the first stage, we set the possible values for in the second stage as 2 2 0 , i =
{3, 2, 1, 0, 1, 2, 3}, where 0 is the best parameter obtained in the first stage. Using the selected
parameter from the second stage, we train each model on the 20 training sets and evaluate the MAE
performance on the corresponding testing set. We then perform a statistical test to find whether the
performance of a model is different with statistical significance from other models. We perform the
Gaussian kernel experiment similarly with model parameter C equals to 2i , i = {0, 3, 6, 9, 12} and
kernel parameter equals to 2i , i = {12, 9, 6, 3, 0} in the first stage. In the second stage, we
set C equals to 2i C0 , i = {2, 1, 0, 1, 2} and equals to 2i 0 , i = {2, 1, 0, 1, 2}, where C0
and 0 are the best parameters obtained in the first stage.

4.2 Results

We report the mean absolute error (MAE) averaged over the dataset splits as shown in Table 3 and
Table 4. We highlight the result that is either the best or not worse than the best with statistical
significance (under paired t-test with = 0.05) in boldface font. We also provide the summary for
each model in terms of the averaged MAE over all datasets and the number of datasets for which
each model marked with boldface font in the bottom of the table.
As we can see from Table 3, in the experiment with the original feature space, threshold-based
models perform well on relatively small datasets, whereas multiclass-based models perform well on
relatively large datasets. A possible explanation for this result is that multiclass-based models have
more flexibility in creating decision boundaries, hence they perform better if the training data size is
sufficient. However, since multiclass-based models have many more parameters than threshold-based
models (m|Y| parameters rather than m + |Y| 1 parameters), multiclass methods may need more
data, and hence, may not perform well on relatively small datasets.
In the threshold-based models comparison, ALord-th , REDth , and AT perform competitively on
relatively small datasets like triazines, wisconsin, machinecpu, and autompg. ALord-th has a

8
slight advantage over REDth on the overall accuracy, and a slight advantage over AT on the number
of indistinguishably best performance on all datasets. We can also see that AT is superior to IT in
the experiments under the original feature space.
Among the multiclass-based models, ALord-mc Table 4: The average of MAE for models with
and REDmc perform competitively on datasets Gaussian kernel.
like abalone, bank, and computer, with a
slight advantage of ALord-mc model on the over- Dataset ALord-th SVORIM SVOREX
all accuracy. In general, the cost-sensitive mod- diabetes 0.696 0.665 0688
els perform poorly compared with ALord-mc and pyrimidines 0.478 0.539 0.550
REDmc . A notable exception is the CSOVO triazines 0.609 0.612 0.604
model which perform very well on the stocks wisconsin 1.090 1.113 1.049
and the boston datasets. machinecpu 0.452 0.652 0.628
autompg 0.529 0.589 0.593
In the Gaussian kernel experiment, we can see boston 0.278 0.324 0.316
from Table 4 that the kernelized version of stocks 0.103 0.099 0.100
ALord-th performs significantly better than the average 0.531 0.574 0.566
threshold-based models SVORIM and SVOREX # bold 7 3 4
in terms of both the overall accuracy and the
number of indistinguishably best performance
on all datasets. We also note that immediate-threshold-based model (SVOREX) performs better than
all-threshold-based model (SVORIM) in our experiment using Gaussian kernel. We can conclude
that our proposed adversarial losses for ordinal regression perform competitively compared to the
state-of-the-art ordinal regression models using both original feature spaces and kernel feature spaces
with a significant performance improvement in the Gaussian kernel experiments.

5 Conclusion and Future Work

In this paper, we have proposed a novel surrogate loss for ordinal regression, a classification problem
where the discrete class labels have an inherent order and penalty for making mistakes based on that
order. We focused on the absolute loss, which is the most widely used ordinal regression loss. In
contrast with existing methods, which typically reduce ordinal regression to binary classification
problems and then employ surrogates for the binary zero-one loss, we derive a unique surrogate
ordinal regression loss by seeking the predictor that is robust to a worst case constrained approx-
imation of the training data. We derived two versions of the loss based on two different feature
representation approaches: thresholded regression and multiclass representations. We demonstrated
the benefit of our approach on a benchmark of datasets for ordinal regression tasks. Our approach
performs competitively compared to the state-of-the-art surrogate losses based on hinge loss. We
also demonstrated cases when the multiclass feature representations works better than thresholded
regression representation, and vice-versa, in our experiments.
Our future work will investigate less prevalent ordinal regression losses, such as the discrete quadratic
loss and arbitrary losses that have v-shaped penalties. Furthermore, we plan to investigate the
characteristics required of discrete ordinal losses for their optimization to have a compact analytical
solution. In terms of applications, one possible direction of future work is to combine our approach
with deep neural network models to perform end-to-end representation learning for ordinal regression
applications like age estimation and rating prediction. In that setting, our proposed loss can be used
in the last layer of a deep neural network to serve as the gradient source for the backpropagation
algorithm.

Acknowledgments
This research was supported as part of the Future of Life Institute (futureoflife.org) FLI-RFP-AI1
program, grant#2016-158710 and by NSF grant RI-#1526379.

9
References

[1] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In
Advances in Neural Information Processing Systems 15, pages 961968. MIT Press, 2003.
[2] Wei Chu and S Sathiya Keerthi. New approaches to support vector ordinal regression. In
Proceedings of the 22nd international conference on Machine learning, pages 145152. ACM,
2005.
[3] Hsuan-Tien Lin and Ling Li. Large-margin thresholded ensembles for ordinal regression:
Theory and practice. In International Conference on Algorithmic Learning Theory, pages
319333. Springer, 2006.
[4] Jason D. M. Rennie and Nathan Srebro. Loss functions for preference levels: Regression with
discrete ordered labels. In Proceedings of the IJCAI Multidisciplinary Workshop on Advances
in Preference Handling, pages 180186, 2005.
[5] Ling Li and Hsuan-Tien Lin. Ordinal regression by extended binary classification. Advances in
neural information processing systems, 19:865, 2007.
[6] Hsuan-Tien Lin. From ordinal ranking to binary classification. PhD thesis, California Institute
of Technology, 2008.
[7] Hsuan-Tien Lin. Reduction from cost-sensitive multiclass classification to one-versus-one
binary classification. In Proceedings of the Sixth Asian Conference on Machine Learning, pages
371386, 2014.
[8] Han-Hsing Tu and Hsuan-Tien Lin. One-sided support vector regression for multiclass cost-
sensitive classification. In Proceedings of the 27th International Conference on Machine
Learning (ICML-10), pages 10951102, 2010.
[9] Peter D. Grnwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrep-
ancy, and robust Bayesian decision theory. Annals of Statistics, 32:13671433, 2004.
[10] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classifi-
cation. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2015.
[11] Subhash C Narula and John F Wellington. The minimum sum of absolute errors regression:
A state of the art survey. International Statistical Review/Revue Internationale de Statistique,
pages 317326, 1982.
[12] Koby Crammer and Yoram Singer. Pranking with ranking. In Advances in Neural Information
Processing Systems 14, 2001.
[13] Peter McCullagh. Regression models for ordinal data. Journal of the royal statistical society.
Series B (Methodological), pages 109142, 1980.
[14] Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. Journal of
Machine Learning Research, 6(Jul):10191041, 2005.
[15] Krzysztof Dembczynski, Wojciech Kotowski, and Roman Sowinski. Ordinal classification
with decision rules. In International Workshop on Mining Complex Data, pages 169181.
Springer, 2007.
[16] Mark J Mathieson. Ordinal models for neural networks. Neural networks in financial engineer-
ing, pages 523536, 1996.
[17] Shipeng Yu, Kai Yu, Volker Tresp, and Hans-Peter Kriegel. Collaborative ordinal regression.
In Proceedings of the 23rd international conference on Machine learning, pages 10891096.
ACM, 2006.
[18] Jianlin Cheng, Zheng Wang, and Gianluca Pollastri. A neural network approach to ordinal
regression. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational
Intelligence). IEEE International Joint Conference on, pages 12791284. IEEE, 2008.
[19] Wan-Yu Deng, Qing-Hua Zheng, Shiguo Lian, Lin Chen, and Xin Wang. Ordinal extreme
learning machine. Neurocomputing, 74(1):447456, 2010.

10
[20] Bing-Yu Sun, Jiuyong Li, Desheng Dash Wu, Xiao-Ming Zhang, and Wen-Bo Li. Kernel
discriminant learning for ordinal regression. IEEE Transactions on Knowledge and Data
Engineering, 22(6):906910, 2010.
[21] Jaime S Cardoso and Joaquim F Costa. Learning to classify ordinal data: The data replication
method. Journal of Machine Learning Research, 8(Jul):13931429, 2007.
[22] Yang Liu, Yan Liu, and Keith CC Chan. Ordinal regression via manifold learning. In Pro-
ceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pages 398403. AAAI
Press, 2011.
[23] Flemming Topse. Information theoretical optimization techniques. Kybernetika, 15(1):827,
1979.
[24] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620630,
1957.
[25] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for
multivariate losses. In Advances in Neural Information Processing Systems, pages 27102718,
2015.
[26] Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in
Neural Information Processing Systems, pages 3745, 2014.
[27] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classification:
A risk minimization perspective. In Advances in Neural Information Processing Systems 29,
pages 559567. 2016.
[28] Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in
Neural Information Processing Systems, pages 42334241. 2016.
[29] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-
based vector machines. The Journal of Machine Learning Research, 2:265292, 2002.
[30] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large
margin methods for structured and interdependent output variables. In JMLR, pages 14531484,
2005.
[31] Thorsten Joachims. A support vector method for multivariate performance measures. In
Proceedings of the International Conference on Machine Learning, pages 377384, 2005.
[32] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods.
The Journal of Machine Learning Research, 8:10071025, 2007.
[33] Yufeng Liu. Fisher consistency of multicategory support vector machines. In International
Conference on Artificial Intelligence and Statistics, pages 291298, 2007.
[34] Miroslav Dudk and Robert E Schapire. Maximum entropy distribution estimation with gener-
alized regularization. In International Conference on Computational Learning Theory, pages
123138. Springer, 2006.
[35] Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal
regression methods. Journal of Machine Learning Research, 18(55):135, 2017.
[36] Harish G Ramaswamy and Shivani Agarwal. Classification calibration dimension for general
multiclass losses. In Advances in Neural Information Processing Systems, pages 20782086,
2012.
[37] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic
average gradient. Mathematical Programming, pages 130, 2013.
[38] Mark Schmidt, Reza Babanezhad, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-uniform
stochastic average gradient method for training conditional random fields. 2015.
[39] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/
ml.

11
Supplementary Materials
A Proof for the Adversarial Ordinal Regression Loss (Theorem 1)
Before proving Theorem 1, we review the game matrix L0xi ,w for ordinal regression problems. Below
is the matrix when the number of classes is four:
f1 fyi f2 fyi + 1 f3 fyi + 2 f4 fyi + 3

f fyi + 1 f2 fyi f3 fyi + 1 f4 fyi + 2
L0xi ,w = 1 (11)
f1 fyi + 2 f2 fyi + 1 f3 fyi f4 fyi + 1
f1 fyi + 3 f2 fyi + 2 f3 fyi + 1 f4 fyi
f1 f2 + 1 f3 + 2 f4 + 3

f + 1 f2 f3 + 1 f4 + 2
= 1 fyi (12)
f1 + 2 f2 + 1 f3 f4 + 1
f1 + 3 f2 + 2 f3 + 1 f4
= L00xi ,w fyi . (13)

Theorem 1. An adversarial ordinal regression predictor is obtained by choosing parameters w that

minimize the empirical risk of the surrogate loss function:
fj + fl + j l fj + j fl l
ALord
w (xi , yi ) = max fyi = max + max fyi , (14)
j,l{1,...,|Y|} 2 j 2 l 2
where fj = w (xi , j) for all j {1, . . . , |Y|}.

Proof. Our proof strategy is to use the inequalities implied by the definition of ALord
w and then show
that the value of ALord
w is equal to the game value of sub-matrices of L 0
xi ,w . We start by showing
the equality for a small 2 by 2 sub-matrix and build up until we show that the value of ALord w is
0 ord
indeed equal to the game value of the whole game matrix Lxi ,w . Empirically minimizing ALw will
conclude the theorem.
Let us begin the proof by denoting v(G) as the Nash equilibrium value of a game characterized
by game matrix G. We would like to prove that for a zero-sum game characterized by L0xi ,w as
f +f +jl
described in Eq. (3), v(L0xi ,w ) = maxj,l{1,...,|Y|} j 2l fyi .
Note that for any game matrix G and any constant c, v(G + c) = v(G) + c. We de-
note L00xi ,w = L0xi ,w + fyi . Thus, proving the theorem is equivalent to proving v(L00xi ,w ) =
f +f +jl
maxj,l{1,...,|Y|} j 2l . The matrix L00xi ,w is similar to the matrix in Eq. (3), but without
including the fyi term in each its cells, i.e.,
f1 f2 + 1 f|Y|1 + |Y| 2 f|Y| + |Y| 1

f1 + 1 f2 f|Y|1 + |Y| 3 f|Y| + |Y| 2
L00xi ,w =
.. .. .. .. ..
. (15)
. . . . .
f + |Y| 2 f + |Y| 3 f f +1
1 2 |Y|1 |Y|
f1 + |Y| 1 f2 + |Y| 2 f|Y|1 + 1 f|Y|

fj +fl +jl
Let j and l be the solution of argmaxj,l{1,...,|Y|} 2 (if there are ties, pick any of them)
f +f +jl fj +fl +j l
and let u = maxj,l{1,...,|Y|} j 2l = 2 . We know the following inequalities
hold:
fj + fl + j l fj + fl + j l, j, l {1, . . . , |Y|} (16)
fj + j fj + j, j {1, . . . , |Y|} (17)
fl l fl l, l {1, . . . , |Y|}. (18)
We also know that j l ; otherwise, we could just swap them to obtain a larger value.
We first focus on the cases where j 6= l . We analyze three different games that are characterized by
subsets of matrix L00xi ,w and show that the value of those games is u .

12
Case 1: Let G1 be a game characterized by a 2 by 2 matrix with values that are taken from rows and
columns j and l of matrix L00xi ,w , i.e.,

fj + j l

fl
G1 = . (19)
fl + j l fj

We will show that v(G1 ) = u . Let p be the vector of adversarys mixed strategy, then finding v(G1 )
is equivalent with solving the following optimization:
max V (20)
s.t. V pl fl + pj (fj + j l ) = pl fl + pj fj + pj (j l )
V pl (fl + j l ) + pj fj = pl fl + pj fj + pl (j l ).
We now analyze the optimization above. Let pl = 0.5 and pj = 0.5 + for some where
0.5 0.5. The optimization above become:
max V (21)
s.t. V (0.5 )fl + (0.5 + )fj + (0.5 + )(j l )
= 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )]
V (0.5 )fl + (0.5 + )fj + (0.5 )(j l )
= 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] .
Since j 6= l , based on Eq. (16), we know that:
fj + fl + j l fj + fj + j j (fj fl ) (j l ) 0, (22)
fj + fl + j l fl + fl + l l (fj fl ) + (j l ) 0. (23)
Therefore, the optimal solution is to set = 0, since setting nonzero will decrease the right-hand
side of one of the constraints and hence decrease the value of V . Thus, the solution is achieved when
f +fl +j l
we set pl = pj = 0.5, which results in a game value of j 2 = u .3
Case 2: Let G2 be a game characterized by a |Y| by 2 matrix with values that are taken from column
j and l of matrix L00xi ,w , i.e.,

fl + l 1 fj + j 1

.. ..
. .

fj + j l

fl

fl + 1 fj + j l 1

G2 =
.. ..
. (24)
. .
f + j l 1 f + 1
l j
f + j l fj
l
.. ..
. .
fl + |Y| l fj + |Y| j

Finding v(G2 ) is equivalent to solving a similar optimization to that of Eq (20) with |Y| constraints
corresponding to each row of matrix G2 instead of just two. We can easily see that the solution is
achieved if we set pl = pj = 0.5 as in the previous case. The right hand side of any m-th constraint
m < l or m > j is dominated, i.e., it has value greater than or equal to u , and the right hand
side of any m-th constraint l < m < l is equal to u . Assigning other values to pl and pj will
decrease the right-hand side of some of the m-th (l m j ) constraints (as explained in case 1),
and hence decrease the value of V . Therefore, we can conclude that v(G2 ) = u .
Case 3: Let G3 be a game characterized by a |Y| by 3 matrix with values that are taken from columns
j , l , and any other column m in matrix L00xi ,w . We consider three variations of the game, G13 where
3
In this analysis and other analyses in this proof, we omit the analysis for the trivial cases where the terms
associated with (in the case above: (fj fl ) + (j l ) and (fj fl ) (j l )) are zero. In this
case, the value of can be anything, but the game value remain the same.

13
m < l , G23 where l < m < j , and G33 where m > j . Below is the game matrix for the first
variation:
.. .. ..
. . .
fm fl + l m fj + j m

.. .. ..
. . .

G13 = fm + l m fl fj + j l . (25)

.. .. ..
. . .

f + j m f + j l fj
m l
.. .. ..
. . .

Let us analyze the optimization for finding the game value for G13 , in particular the l -th and j -th
constraints:
max V (26)
..
s.t. .
V pm (fm + l m) + pl fl + pj (fj + j l )
V pm (fm + j m) + pl (fl + j l ) + pj fj
..
..
Let us use the notation similar to Case 1. Let pm = , pl = 0.5 and pj = 0.5 + where
0.5 0.5; 0 1; and 0.5 + 0.5. We can write the constraints above as:
V 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )] + [(fm m) (fl l )]
V 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] + [(fm m) (fl l )] .
Since (fj fl ) + (j l ) 0; (fj fl ) (j l ) 0; and (fm m) (fl l ) 0,
the optimal solution is setting = 0, and = 0. Since pm = = 0, we leave with the same game
matrix as G2 . Therefore v(G13 ) = u .
For G33 , we let pm = , pl = 0.5 and pj = 0.5 + where 0.5 0.5; 0 1;
and 0.5 0.5. Similar to the previous case, l -th and j -th constraints can be written as:
V 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )] + [(fm + m) (fj + j )]
V 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] + [(fm + m) (fj + j )] .
Due to a similar reason as in the previous case, and (fm + m) (fj + j ) 0, the optimal solution
is to set = 0, and = 0, and hence v(G33 ) = u .
For G23 , we will analyze the l -th, m-th, and j -th constraint. Let pm = , pl = 0.5 and
pj = 0.5 + where 0.5 0.5; 0 1; and 0.5 0.5. The constraints
can be written as:
V 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )] + [(fm + m) (fj + j )]
V 0.5 (fl + fj + j l ) + [fj fl + j + l 2m] + [(fm + m) (fj + j )]
V 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] + [(fm m) (fj j )] .
We know that (fj fl )+(j l ) 0; (fj fl )(j l ) 0 and (fm +m)(fj +j ) 0.
If it is the case that fj fl + j + l 2m 0, or (fm m) (fj j ) 0, or both, it will force
both and to be 0. If both of them are positive, we need an additional analysis as the following.
We focus on the m-th, and j -th constraints. Since we want to check if there is a combination of
and values that make the game value greater than u , and have to satisfy the following:
[fj fl + j + l 2m] + [(fm + m) (fj + j )] 0 (27)
(fj + j ) (fm + m) (fj + j ) (fm m) 2m
= , (28)
fj fl + j + l 2m (fj + j ) (fl l ) 2m

14
[(fj fl ) (j l )] + [(fm m) (fj j )] 0 (29)
(j l ) (fj fl ) (fl l ) (fj j )

= . (30)
(fm m) (fj j ) (fm m) (fj j )
We know that (fj +j )(fm m)2m (fj +j )(fl l )2m, and (fl l )(fj j )
(fm m) (fj j ). If at least one of those inequalities is strict, e.g., the first inequality, it is
better to set = = 0, since in order to increase the value of RHS of the m-th constraint has to
be strictly greater than , which will decrease the RHS of the j -th constraint and thus decrease the
game value. If both are equal, then many solutions exist, i.e., = , but the game value remains
the same, i.e. u , since in this case [fj fl + j + l 2m] + [(fm + m) (fj + j )] =
[(fj fl ) (j l )] + [(fm m) (fj j )] = 0. Therefore v(G23 ) = u .
Note that we omit the analysis for the trivial cases when the terms associated with and are zero.
In those cases, any value of and will satisfy the constraints, but the game value remain the same.
Conclusion: We are now ready to analyze the game value for L00xi ,w . Since adding any column
m {1, . . . , |Y|}\{l , j } to G2 will not change the game value, then adding the combination of
them will not change the game value either. Therefore, we can conclude that v(L00xi ,w ) = u .
f +f +jl
For the case that j = l , we know that maxj,l{1,...,|Y|} j 2l = fj . It is clear that fj is the
solution for the game that is defined by column j from matrix L00xi ,w . For any other column m, if we
include it in the game, the corresponding j -th constraint become (we let pm = , and pj = 1 ):
V fj + [(fm m) (fj j )] if m < j , or (31)
V fj + [(fm + m) (fj + j )] if m > j . (32)
Since we know that (fj j ) (fm m), and (fj + j ) (fm + m), the optimal solution is
to set = 0, and the game value remain the same. We can also generalize it to all combination of
column m {1, . . . , |Y|}\{j } to show that v(L00xi ,w ) = fj = u .
Therefore, we can conclude that the value of the game matrix v(L00xi ,w ) =
f +f +jl
maxj,l{1,...,|Y|} j 2l , which proves the theorem.

B Proof in the Consistency Analysis (Theorem 2 & Theorem 3)

Theorem 2. The minimizer vector f of EY |XP ALord

f (X, Y )|X = x satisfies the loss reflective
property, i.e., it complements the absolute error by starting with a negative integer value, then
increasing by one until reaching zero, and then incrementally decreases again.

Proof. We start the proof by analyzing the minimizer f using Py , P (y|x) as follows:
f = argmin EY |XP ALord

f (X, Y )|X = x (33)
f
X fj + fl + j l

= argmin Py max fy (34)
f y
j,l{1,...,|Y|} 2
" #
X fj + fl + j l X
= argmin Py max Py fy (35)
f y
j,l{1,...,|Y|} 2 y
" #
fj + fl + j l X
= argmin max Py fy . (36)
f j,l{1,...,|Y|} 2 y

In this proof, we employ a constraint to the potential function, max fj (x) = 0, in order to remove
ord as adding any constant c to f does not change the value of both argmax fj (x),
redundant solutions,
and EY |XP ALf (X, Y )|X = x :
fj + c + fl + c + j l X
max Py (fy + c) (37)
j,l{1,...,|Y|} 2 y

15
fj + fl + j l X
=c + max c Py (fy ) (38)
j,l{1,...,|Y|} 2 y
fj + fl + j l X
= max Py (fy ). (39)
j,l{1,...,|Y|} 2 y

f +f +jl
Let j and l be the solution of argmaxj,l{1,...,|Y|} j 2l . We will start from the first

case whereh j = l . In this case,
i the minimization in Eq. (36) can be reduced to

P
argminf maxj{1,...,|Y|} fj y Py fy . Since j = l , we know that the following inequali-
ties hold:
fj fj j {1, . . . , |Y|} (40)
fj + j fj + j, j {1, . . . , |Y|} (41)

fj j fj j, j {1, . . . , |Y|}. (42)
Therefore, by Eq. (40) and constraint max fj (x) = 0, we have fj = 0. Then by Eq. (41), for any
i > 0, fj +i fj i = i;
P and also by PEq. (42), for any i > 0, fj i fj i = i. Since
we want to minimize fj y Py fy = y Py fy , the optimal solution is to set fj +i = i and
fj i = i for any i > 0. Therefore we get vector f that satisfies the loss reflective property, i.e., it
complements the absolute error by starting with a negative integer value, then increasing by one until
reaching zero, and then incrementally decreases again.
We next analyze the second case where j 6= l . In this case, the following inequalities hold:
fj + j fj +i + j + i fj +i fj i, i {j + 1, . . . , |Y| j } (43)
fl l fl +i l i fl +i fl + i, i {l + 1, . . . , |Y| l }. (44)
We also know that for any m {1, . . . , |Y|} the following holds:
m < l fm fl (l m) and fm fj + (j m) (45)
m > j fm fj (m j ) and fm fl + (m l ) (46)
l < m < j fm fl + (m l ) and fm fj + (j m).

(47)
The relation between fj and fl in the following also holds:
fj fl + j l (48)
fl fj + j l . (49)

Let f 0 be any potential function which falls into the second case (the solution of (j , l ) =
fj0 +fl0 +jl
argmaxj,l{1,...,|Y|} 2 satisfies j 6= l ) where f 0 does not satisfy the loss reflective prop-
f +fl +j l
y Py fy . We will show that we can construct f 1 as
P
erty. Let us define h(f ) = j 2
follows. Starting from f 1 = f 0 we increase all the values of fm 1
for m {1, . . . , |Y|}\{l , j }
such that it satisfies the constraints above with equality for the one that has minimum value. For
example, in a 7-class ordinal regression where l = 2 and j = 6, one of possible value for f 0
is [3, 1.4, 0.8, 0.2, 0.7, 0, 1.2]T which satisfies all the constraints above. In this case f 1
f +fl +j l
will be [2.4, 1.4, 0.4, 0.6, 1, 0, 1]T . Since the value of j 2 remains the same and
the value of y Py fy is increasing, we know that h(f 1 ) < h(f 0 ). We know that in f 1 , fj fj1
P

is equal to 1 or -1, except for a pair (a, b), where l a < b j . In the example above
f 1 +f 1 +j l f 1 +f 1 +1
a = 4, b = 5, fa1 = 0.6, and fb1 = 1. We also know that j l
2 = a 2b .
a
We now construct f 2 from f 1 as follows. If y=1 Py 0.5, we set fj2 = fj1 (fa1 fb1 + 1) for
P

j {1, . . . , a} and set fj = fj for j {b, . . . , |Y|}; otherwise we set fj2 = fj1 for j {1, . . . , a}
2 1
Pa
and set fj2 = fj1 (fb1 fa1 + 1) for j {b, . . . , |Y|}. For the example above, if y=1 Py 0.5
then f 2 = [3, 2, 1, 0, 1, 0, 1], otherwise f 2 = [2.4, Pa 1.4, 0.4, 0.6, 0.4, 1.4, 2.4]. We
claim that h(f 2 ) h(f 1 ) as shown for the case that y=1 Py 0.5 (the other case follows in a
similar way):
fj2 + fl2 + j l X X
h(f 2 ) = max Py fy2 = fb2 Py fy2 (50)
j,l{1,...,|Y|} 2 y y

16
a |Y|
X X
= fb2 Py fy2 Py fy2 (51)
y=1 y=b
a |Y|
X X
= fb1 Py fy1 (fa1 fb1 + 1) Py fy1

(52)
y=1 y=b
a
X X
fb1 Py fa1 fb1 + 1 Py fy1

= + (53)
y=1 y
X f 1 + fb1 + 1 X
fb1 + 0.5 fa1 fb1 + 1 Py fy1 = a Py fy1 = h(f 1 ).

(54)
y
2 y

Finally, we construct f 3 = f 2 maxj fj2 . Since adding a constant to any f does not change the value of
h(f ), we know that h(f 3 ) = h(f 2 ). P
We also know that f 3 satisfies the loss reflective property described
a
above. As an example, in the case y=1 Py 0.5, then f 3 = [4, 3, 2, 1, 0, 1, 2].
Since for any f 0 that falls into the second case where the solution for (j , l ) =
f 0 +f 0 +jl 0
argmaxj,l{1,...,|Y|} j l2 satisfies j 6= l and f does not satisfy the loss reflective property,
we can construct f which satisfies the loss reflective property and having the value of h(f 3 ) < h(f 0 ),
3

then f 0 cannot be the minimizer. Therefore, we can conclude that in the first and second cases, the
minimizer has to satisfy the loss reflective property which complete the proof of the theorem.

Theorem 3. The adversarial ordinal regression surrogate loss ALord from Eq. (5) is Fisher consistent.

Proof. We denote h(f ) , EY |XP ALord

f (X, Y )|X = x . Based on Theorem 2, the minimization
argmin h(f ) reduces to the minimization over the set that contains all f that satisfies the loss reflective
property and maxj fj = 0. Note that the set contains only |Y| items. In the case of argmaxj fj = j ,
we know that f that satisfies the loss reflective property has values fj = |j j|, and hence:
X |Y|
X fj + fl + j l X
h(f ) = Py max fy = Py [fj fy ] = fj Py fy (55)
y
j,l{1,...,|Y|} 2 y y=1
|Y| |Y| |Y|
X X X
= Py fy = Py (|j y|) = Py |j y|.
y=1 y=1 y=1

Therefore, the minimizer f = argmin h(f ) satisfies argmaxj fj (x) argminj

P
y Py |j y| and
implies Fisher consistency.

C Primal Optimization in Details

To optimize the regularized adversarial ordinal regression loss in the primal, we employ stochastic
average gradient (SAG) methods [37, 38]. SAG has been shown to converge faster than standard
stochastic gradient optimization [37, 38]. In this section, we focus on the adversarial adversarial
ordinal regression with multiclass representation (ALord-mc
w ). A version for the thresholded regression
representation follows in a similar way.
Given the regularization constant and the learning rate , the standard batch gradient update for
risk minimization can be written as:
" n # n
t+1 t 1X t X t
w =w gi + w = (1 ) wt
t
g, (56)
n i=1 n i=1 i

where gi is the loss gradient with respect to i-th example. The idea of SAG is to use the gradient
of each example from the last iteration where it was selected to take a step. However, the nave
implementation of SAG requires storing the gradient of each sample, which may be expensive in
terms of the memory requirements.

17
Algorithm 1 SAG for adversarial ordinal regression with multiclass representation
1: Input: training dataset with pairs {xi , yi }, learning rate , regularization constant
2: m0 {the number of sampled pairs so far}
Pm
3: d0 {for storing i=1 gi }
4: ji 0, li 0 for i = 1, 2, . . . , n
5: repeat
6: Sample i from {1, . . . , n}
w x +w x +jl
7: j , l argmaxj,l j i 2l i wyi xi
8: if it is the first time we sample i then
9: mm+1
10: dj dj + 21 xi , dl dl + 21 xi
11: dyi dyi xi
12: else
13: dji dji 12 xi , dli dli 12 xi
14: dj dj + 12 xi , dl dl + 21 xi
15: end if
16: ji j , li l

17: w (1 )w m d
18: until converge

Fortunately, for ALord-mc

w , we can drastically reduce this memory requirement by not directly storing
w x +w x +jl
the gradient using the following technique. Let j , l = argmaxj,l j i 2l i wyi xi .
Assuming that j 6= l 6= yi , we know that the sub-gradients are: wj = 12 xi , wl = 12 xi , and
wyi = xi , while wk = 0 for k {1, . . . , |Y|}\{j , l , yi }. Therefore, instead of storing the
sub-gradient, we can just store j and l . Let us denote ji and li for i = 1, 2, . . . , n as the storage
for each examples last j and l . We also construct a vector d which
Pm has the same length as our
parameter vector w to store the sum of the latest gradients, i.e. d = i=1 gi , where m is the number
of training pairs {xi , yi } sampled so far. Using this notation, Algorithm 1 describes this technique
for implementing SAG for adversarial ordinal regression loss with multiclass representation.

D Dual Optimization in Details

Based on Equation 5, the primal optimization of regularized adversarial ordinal regression loss can
be written as:

n
1 X w (xi , j) + j w (xi , j) j
min kwk2 + C max + max w (xi , yi )
w 2 i=1
j1,...,|Y| 2 j1,...,|Y| 2
(57)
n
1 CX
= min kwk2 + max (w (xi , j) w (xi , yi ) + j) (58)
w 2 2 i=1 j1,...,|Y|
n
CX
+ max (w (xi , j) w (xi , yi ) j) .
2 i=1 j1,...,|Y|

The optimization above is equivalent with the following constrained optimization:

n n
1 CX CX
min kwk2 + i + i (59)
w 2 2 i=1 2 i=1
subject to: i w (xi , j) w (xi , yi ) + j i {1, . . . n}; j {1, . . . , |Y|}
i w (xi , j) w (xi , yi ) j i {1, . . . n}; j {1, . . . , |Y|}.

18
The Lagrangian for the optimization above is:
n n n X|Y|
1 CX CX X
L = kwk2 + i + i i,j [i w (xi , j) + w (xi , yi ) j] (60)
2 2 i=1 2 i=1 i=1 j=1
|Y|
n X
X
i,j [i w (xi , j) + w (xi , yi ) + j].
i=1 j=1

The KKT conditions:

|Y|
n X |Y|
n X
X X
w L = w i,j [(xi , j) + (xi , yi )] i,j [(xi , j) + (xi , yi )] = 0
i=1 j=1 i=1 j=1
|Y|
n X
X
= w = (i,j + i,j ) [(xi , yi ) (xi , j)]
i=1 j=1
|Y| |Y|
C X X C
i L = i,j = 0 = i,j =
2 j=1 j=1
2
|Y| |Y|
C X X C
i L = i,j = 0 = i,j =
2 j=1 j=1
2
i, j, i,j [i w (xi , j) + w (xi , yi ) j] = 0
= i,j = 0 i = w (xi , j) w (xi , yi ) + j
i, j, i,j [i w (xi , j) + w (xi , yi ) + j] = 0
= i,j = 0 i = w (xi , j) w (xi , yi ) j.

Rearranging the Lagrangian formula and then plugging the definition of w in terms of the dual
variables and applying the constraints yields:
|Y|
n X
X
L= j(i,j i,j ) (61)
i=1 j=1
n |Y|
1 X X
(i,j + i,j ) (k,l + k,l ) ((xi , j) (xi , yi )) ((xk , l) (xl , yk )) .
2
i,k=1 j,l=1

Therefore, the dual optimization can be written as:

X
max j(i,j i,j ) (62)
,
i,j
1 X
(i,j + i,j ) (k,l + k,l ) ((xi , j) (xi , yi )) ((xk , l) (xl , yk ))
2
i,j,k,l
X X
subject to: i,j 0; i,j 0; i,j = C2 ; i,j = C2 ; i, k {1, . . . , n}; j, l {1, . . . , |Y|}.
j j

Class X Maths Practice Paper 2023-24 (DPS, GZD)
100% (4)
Class X Maths Practice Paper 2023-24 (DPS, GZD)
5 pages
Basic Concepts & Formulas of Number System
100% (5)
Basic Concepts & Formulas of Number System
7 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
AJC H2Maths 2012prelim P1 Solution
No ratings yet
AJC H2Maths 2012prelim P1 Solution
6 pages
VCE Mathematical Methods Curriculum and Khan Academy Links
No ratings yet
VCE Mathematical Methods Curriculum and Khan Academy Links
3 pages
15 Surrogate
No ratings yet
15 Surrogate
3 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Risk Minimization
No ratings yet
Risk Minimization
12 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
w03-LectureSlices-MA4550
No ratings yet
w03-LectureSlices-MA4550
28 pages
Bartlett 08 A
No ratings yet
Bartlett 08 A
18 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
On Expected Accuracy: Ozan İrsoy
No ratings yet
On Expected Accuracy: Ozan İrsoy
6 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Linear Regression
No ratings yet
Linear Regression
26 pages
Chapter 4a Riskmin-Reg - Commented4
No ratings yet
Chapter 4a Riskmin-Reg - Commented4
54 pages
J Jspi 2005 01 004 PDF
No ratings yet
J Jspi 2005 01 004 PDF
25 pages
Three Approaches To Ordinal Classification (Slides 2009)
No ratings yet
Three Approaches To Ordinal Classification (Slides 2009)
25 pages
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
No ratings yet
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
23 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Sol Advriskmin 2
No ratings yet
Sol Advriskmin 2
3 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Cross-Entropy Loss Functions: Theoretical Analysis and Applications
No ratings yet
Cross-Entropy Loss Functions: Theoretical Analysis and Applications
26 pages
Debiasing Linear Prediction
No ratings yet
Debiasing Linear Prediction
37 pages
Unit 2
No ratings yet
Unit 2
92 pages
Aer15 Prediction
No ratings yet
Aer15 Prediction
5 pages
RenSun Sankhya2004 ComparisonBayesFreqtstPrediction
No ratings yet
RenSun Sankhya2004 ComparisonBayesFreqtstPrediction
29 pages
Universal Regression With Adversarial Responses
No ratings yet
Universal Regression With Adversarial Responses
59 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Logistic Regression Loss
No ratings yet
Logistic Regression Loss
7 pages
Week_7
No ratings yet
Week_7
21 pages
Lecture 14
No ratings yet
Lecture 14
9 pages
Deep Learning 3
No ratings yet
Deep Learning 3
7 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
EE769 Mid Sem Solution 2018 v2
No ratings yet
EE769 Mid Sem Solution 2018 v2
2 pages
Sol Multiclass 1
No ratings yet
Sol Multiclass 1
5 pages
Zr62nbxFEeiJFxIALfRvzg 66dc0260bc4511e8926aa90c9efb0126 2.1.10 How Do We Define Learning
No ratings yet
Zr62nbxFEeiJFxIALfRvzg 66dc0260bc4511e8926aa90c9efb0126 2.1.10 How Do We Define Learning
7 pages
Unit 2
No ratings yet
Unit 2
8 pages
Representer Function
No ratings yet
Representer Function
12 pages
L3-linear-regression
No ratings yet
L3-linear-regression
23 pages
Mlda U1
No ratings yet
Mlda U1
10 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Online Passive-Aggressive Algorithms
No ratings yet
Online Passive-Aggressive Algorithms
35 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Sharp Concentration of Uniform Generalization Errors in Binary Linear Classification
No ratings yet
Sharp Concentration of Uniform Generalization Errors in Binary Linear Classification
26 pages
Online Learning Survey (ML Reading Group ACEMS)
No ratings yet
Online Learning Survey (ML Reading Group ACEMS)
90 pages
1 Online Linear Regression: COS 511: Theoretical Machine Learning
No ratings yet
1 Online Linear Regression: COS 511: Theoretical Machine Learning
7 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
ML Opt
No ratings yet
ML Opt
89 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
One-Pass AUC Optimization
No ratings yet
One-Pass AUC Optimization
23 pages
Quiz2 Mock Solutions
No ratings yet
Quiz2 Mock Solutions
19 pages
Math 9 Q2 Module 3 Final Copy 1
100% (1)
Math 9 Q2 Module 3 Final Copy 1
16 pages
A. Find The Coordinates of All Maximum and Minimum Points On The Given Interval. Justify Your
No ratings yet
A. Find The Coordinates of All Maximum and Minimum Points On The Given Interval. Justify Your
6 pages
PP - Sequences - and - Sums - Markscheme - Ib Maths AA SL
No ratings yet
PP - Sequences - and - Sums - Markscheme - Ib Maths AA SL
19 pages
Diagonalization Matrix
No ratings yet
Diagonalization Matrix
30 pages
Cal 11 Q3 0305 Final
No ratings yet
Cal 11 Q3 0305 Final
22 pages
Hkust: MATH2011 Introduction To Multivariable Calculus
No ratings yet
Hkust: MATH2011 Introduction To Multivariable Calculus
6 pages
Moon Asaki Snipes Application Inspired Linear Algebra
No ratings yet
Moon Asaki Snipes Application Inspired Linear Algebra
538 pages
Permutations and Combinations Yr 11 (3U)
No ratings yet
Permutations and Combinations Yr 11 (3U)
15 pages
Asymptotic and Perturbation Methods
No ratings yet
Asymptotic and Perturbation Methods
129 pages
Mca Montessori School Mathematics - Grade 9 1 Summative Test
No ratings yet
Mca Montessori School Mathematics - Grade 9 1 Summative Test
3 pages
Sequences Worksheet
No ratings yet
Sequences Worksheet
3 pages
Soling Linear Inequalities
No ratings yet
Soling Linear Inequalities
9 pages
Injective, Surjective and Bijective
No ratings yet
Injective, Surjective and Bijective
6 pages
45 MA515 Notes
No ratings yet
45 MA515 Notes
52 pages
7 Properties of Curves: Objectives
No ratings yet
7 Properties of Curves: Objectives
42 pages
Quadratic Equations
No ratings yet
Quadratic Equations
3 pages
Calculus 2 Worksheet 3
100% (1)
Calculus 2 Worksheet 3
2 pages
VJC 9758 2023 Prelim P1
No ratings yet
VJC 9758 2023 Prelim P1
5 pages
28 Logarithms 04 03 2025
No ratings yet
28 Logarithms 04 03 2025
31 pages
Numerical Methods PPT AYUSH MISHRA
No ratings yet
Numerical Methods PPT AYUSH MISHRA
11 pages
Jackson 3.4 Homework Problem Solution
No ratings yet
Jackson 3.4 Homework Problem Solution
5 pages
Eo Y8 Assessment 2022
No ratings yet
Eo Y8 Assessment 2022
14 pages
12.fundamental of Mathematics-I
No ratings yet
12.fundamental of Mathematics-I
48 pages
GEOMETRY OF SETS AND MEASURES IN EUCLIDEAN SPACES - General - Measure - Theory
No ratings yet
GEOMETRY OF SETS AND MEASURES IN EUCLIDEAN SPACES - General - Measure - Theory
16 pages
SETS & RELATIONS - Sri Chaitanya
No ratings yet
SETS & RELATIONS - Sri Chaitanya
13 pages
ماث تراكمي
0% (1)
ماث تراكمي
17 pages

Fath On y 2017 Adversarial

Uploaded by

Fath On y 2017 Adversarial

Uploaded by

Adversarial Surrogate Losses for Ordinal Regression

Rizal Fathony Mohammad Bashiri Brian D. Ziebart

2 Background and Related Work

2.2 Threshold Methods for Ordinal Regression

2.3 Reduction Framework from Ordinal Regression to Binary Classification

2.4 Cost-sensitive Classification Methods for Ordinal Regression

2.5 Adversarial Prediction

3 Adversarial Ordinal Regression

f1 fyi f|Y| fyi + |Y| 1

where: w represents a vector of Lagrangian model parameters; fj = w (xi , j) is a Lagrangian

3.2 Feature representations

We consider two feature representations corresponding to different training data summaries:

The first, which we call the thresholded regression representation,

Thresholded regression surrogate loss

Multiclass ordinal surrogate loss

3.4 Consistency Properties

3.5.1 Primal Optimization

3.5.2 Dual Optimization

4 Experiments Table 2: Dataset properties.

5 Conclusion and Future Work

Theorem 1. An adversarial ordinal regression predictor is obtained by choosing parameters w that

B Proof in the Consistency Analysis (Theorem 2 & Theorem 3)

Theorem 2. The minimizer vector f of EY |XP ALord

Proof. We denote h(f ) , EY |XP ALord

Therefore, the minimizer f = argmin h(f ) satisfies argmaxj fj (x) argminj

C Primal Optimization in Details

Fortunately, for ALord-mc

D Dual Optimization in Details

The optimization above is equivalent with the following constrained optimization:

The KKT conditions:

Therefore, the dual optimization can be written as:

You might also like