Fath On y 2017 Adversarial
Fath On y 2017 Adversarial
Abstract
Ordinal regression seeks class label predictions when the penalty incurred for
mistakes increases according to an ordering over the labels. The absolute error
is a canonical example. Many existing methods for this task reduce to binary
classification problems and employ surrogate losses, such as the hinge loss. We
instead derive uniquely defined surrogate ordinal regression loss functions by
seeking the predictor that is robust to the worst-case approximations of training
data labels, subject to matching certain provided training data statistics. We
demonstrate the advantages of our approach over other surrogate losses based on
hinge loss approximations using UCI ordinal prediction tasks.
1 Introduction
For many classification tasks, the discrete class labels being predicted have an inherent order (e.g.,
poor, fair, good, very good, and excellent labels). Confusing two classes that are distant from one
another (e.g., poor instead of excellent) is more detrimental than confusing two classes that are nearby.
The absolute error, |y y| between label prediction (y Y) and actual label (y Y) is a canonical
ordinal regression loss function. The ordinal regression task seeks class label predictions for new
datapoints that minimize losses of this kind.
Many prevalent methods reduce the ordinal regression task to subtasks solved using existing super-
vised learning techniques. Some view the task from the regression perspective and learn both a linear
regression function and a set of thresholds that define class boundaries [15]. Other methods take a
classification perspective and use tools from cost-sensitive classification [68]. However, since the
absolute error of a predictor on training data is typically a non-convex (and non-continuous) function
of the predictors parameters for each of these formulations, surrogate losses that approximate the
absolute error must be optimized instead. Under both perspectives, surrogate losses for ordinal regres-
sion are constructed by transforming the surrogate losses for binary zero-one loss problemssuch as
the hinge loss, the logistic loss, and the exponential lossto take into account the different penalties
of the ordinal regression problem. Empirical evaluations have compared the appropriateness of
different surrogate losses, but these still leave the possibility of undiscovered surrogates that align
better with the ordinal regression loss.
To address these limitations, we seek the most robust [9] ordinal regression predictions by focusing
on the following adversarial formulation of the ordinal regression task: what predictor best minimizes
absolute error in the worst case given partial knowledge of the conditional label distribution? We
answer this question by considering the Nash equilibrium for a game defined by combining the loss
function with Lagrangian potential functions [10]. We derive a surrogate loss function for empirical
risk minimization that realizes this same adversarial predictor. We show that different types of
available knowledge about the conditional label distribution lead to thresholded regression-based
predictions or classification-based predictions. In both cases, the surrogate loss is novel compared to
existing surrogate losses. We also show that our surrogate losses enjoy Fisher consistency, a desirable
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
theoretical property guaranteeing that minimizing the surrogate loss produces Bayes optimal decisions
for the original loss in the limit. We develop two different approaches for optimizing the loss: a
stochastic optimization of the primal objective and a quadratic program formulation of the dual
objective. The second approach enables us to efficiently employ the kernel trick to provide a richer
feature representation without an overly burdensome time complexity. We demonstrate the benefits
of our adversarial formulation over previous ordinal regression methods based on hinge loss for a
range of prediction tasks using UCI datasets.
Ordinal regression is a discrete label prediction problem characterized by Table 1: Ordinal re-
an ordered penalty for making mistakes: loss(y1 , y) < loss(y2 , y) if y < gression loss matrix.
y1 < y2 or y > y1 > y2 . Though many loss functions possess this property,
0 1 2 3
the absolute error |y y| is the most widely studied. We similarly restrict
1 0 1 2
our consideration to this loss function in this paper. The full loss matrix 2 1 0 1
L for absolute error with four labels is shown in Table 1. The expected
3 2 1 0
loss incurred using a probabilistic predictor P (y|x) evaluated on true data
P
distribution P (x, y) is: EX,Y P ;Y |XP [LY ,Y ] = x,y,y P (x, y)P (y|x)Ly,y . The supervised
learning objective for this problem setting is to construct a probabilistic predictor P (y|x) in a way
that minimizes this expected loss using training samples distributed according to the empirical
distribution P (x, y), which are drawn from the unknown true data generating distribution, P (x, y).
A nave ordinal regression approach relaxes the task to a continuous prediction problem, minimizes
the least absolute deviation [11], and then rounds predictions to nearest integral label [12]. More
sophisticated methods range from using a cumulative link model [13] that assumes the cumulative
conditional probability P (Y j|x) follows a link function, to Bayesian non-parametric approaches
[14] and many others [1522]. We narrow our focus over this broad range of methods found in the
related work to those that can be viewed as empirical risk minimization methods with piece-wise
convex surrogates, which are more closely related to our approach.
Threshold methods are one popular family of techniques that treat the ordinal response variable,
f , w x, as a continuous real-valued variable and introduce |Y| 1 thresholds 1 , 2 , ..., |Y|1
that partition the real line into |Y| segments: 0 = < 1 < 2 < ... < |Y|1 < |Y| =
[4]. Each segment corresponds to a label with yi assigned label j if j1 < f j . There are two
different approaches for constructing surrogate losses based on the threshold methods to optimize the
choice of w and 1 , . . . , |Y|1 : one is based on penalizing all thresholds involved when a mistake is
made and one is based on only penalizing the most immediate thresholds.
All thresholds methods penalize every erroneous threshold using a surrogate loss, , for sets of binary
Py1 P|Y|
classification problems: lossAT (f, y) = k=1 ((k f)) + k=y (k f). Shashua and Levin
[1] studied the hinge loss under the name of support vector machines with a sum-of margin strategy,
while Chu and Keerthi [2] proposed a similar approach under the name of support vector ordinal
regression with implicit constraints (SVORIM). Lin and Li [3] proposed ordinal regression boosting,
an all thresholds method using the exponential loss as a surrogate. Finally, Rennie and Srebro [4]
proposed a unifying approach for all threshold methods under a variety of surrogate losses.
Rather than penalizing all erroneous thresholds when an error is made, immediate thresholds methods
only penalize the threshold of the true label and the threshold immediately beneath the true label:
lossIT (f, y) = ((y1 f)) + (y f).1 Similar to the all thresholds methods, immediate thresh-
old methods have also been studied in the literature under different names. For hinge loss surrogates,
Shashua and Levin [1] called the model support vector with fixed-margin strategy while Chu and
Keerthi [2] use the term support vector ordinal regression with explicit constraints (SVOREX). For
1
For the boundary labels, the method defines ((0 f)) = (y+1 f) = 0.
2
the exponential loss, Lin and Li [3] introduced ordinal regression boosting with left-right margins.
Rennie and Srebro [4] also proposed a unifying framework for immediate threshold methods.
Li and Lin [5] proposed a reduction framework to convert ordinal regression problems to binary
classification problems by extending training examples. For each training sample (x, y), the reduction
framework creates |Y| 1 extended samples (x(j) , y (j) ) and assigns weight wy,j to each extended
sample. The binary label associated with the extended sample is equivalent to the answer of the
question: is the rank of x greater than j? The reduction framework allows a choice for how extended
samples x(j) are constructed from original samples x and how to perform binary classification. If
the threshold method is used to construct the extended sample and SVM is used as the binary
classification algorithm, the classifier can be obtained by solving a family of quadratic optimization
problems that includes SVORIM and SVOREX as special instances.
Rather than using thresholding or the reduction framework, ordinal regression can also be cast as a
special case of cost-sensitive multiclass classification. Two of the most popular classification-based
ordinal regression techniques are extensions of one-versus-one (OVO) and one-versus-all (OVA) cost-
sensitive classification [6, 7]. Both algorithms leverage a transformation that converts a cost-sensitive
classification problem to a set of weighted binary classification problems. Rather than reducing
to binary classification, Tu and Lin [8] reduce cost-sensitive classification to one-sided regression
(OSR), which can be viewed as an extension of the one-versus-all (OVA) technique.
Foundational results establish a duality between adversarial logarithmic loss minimization and
constrained maximization of the entropy [23]. This takes the form of a zero-sum game between
a predictor seeking to minimize expected logarithmic loss and an adversary seeking to maximize
this same loss. Additionally, the adversary is constrained to choose a distribution that matches
certain sample statistics. Ultimately, through the duality to maximum entropy, this is equivalent
to maximum likelihood estimation of probability distributions that are members of the exponential
family [23]. Grnwald and Dawid [9] emphasize this formulation as a justification for the principle of
maximum entropy [24] and generalize the adversarial formulation to other loss functions. Extensions
to multivariate performance measures [25] and non-IID settings [26] have demonstrated the versatility
of this perspective.
Recent analysis [27, 28] has shown that for the special case of zero-one loss classification, this
adversarial formulation is equivalent to empirical risk minimization with a surrogate loss function:
X
AL0-1
f (xi , yi ) = max (j,yi (xi ) + |S| 1)/|S|, (1)
S{1,...,|Y|},S6=
jS
where j,yi (xi ) is the potential difference j,yi (xi ) = fj (xi ) fyi (xi ). This surrogate loss function
provides a key theoretical advantage compared to the Crammer-Singer hinge loss surrogate for
multiclass classification [29]: it guarantees Fisher consistency [27] while Crammer-Singerdespite
its popularity in many applications, such as Structured SVM [30, 31]does not [32, 33]. We extend
this type of analysis to the ordinal regression setting with the absolute error as the loss function in
this paper, producing novel surrogate loss functions that provide better predictions than other convex,
piece-wise linear surrogates.
We seek the ordinal regression predictor that is the most robust to uncertainty given partial knowledge
of the evaluating distributions characteristics. This takes the form of a zero-sum game between a
predictor player choosing a predicted label distribution P (y|x) that minimizes loss and an adversarial
3
player choosing an evaluation distribution P (y|x) that maximizes loss while closely matching the
feature-based statistics of the training data:
h i
min max EXP ;Y |XP ;Y |XP Y Y such that: EXP ;Y |XP [(X, Y )] = . (2)
P (y|x) P (y|x)
The vector of feature moments, = EX,Y P [(X, Y )], is measured from sample training data
distributed according to the empirical distribution P (x, y).
An ordinal regression problem can be viewed as a cost-sensitive loss with the entries of the cost
matrix defined by the absolute loss between the row and column labels (an example of the cost
matrix for the case of a problem with four labels is shown in Table 1). Following the construction of
adversarial prediction games for cost-sensitive classification [10], the optimization of Eq. (2) reduces
to minimizing the equilibrium game values of a new set of zero-sum games characterized by matrix
L0xi ,w :
yx I(y = 1)x
I(y 1) I(y = 2)x
th (x, y) = I(y 2)
; and
I(y = 3)x
mc (x, y) = . (4)
.. ..
. .
I(y |Y| 1) I(y = |Y|)x
4
3.3 Adversarial Loss from the Nash Equilibrium
We now present the main technical contribution of our paper: a surrogate loss function that, when
minimized, produces a solution to the adversarial ordinal regression problem of Eq. (3).2
Theorem 1. An adversarial ordinal regression predictor is obtained by choosing parameters w that
minimize the empirical risk of the surrogate loss function:
fj + fl + j l fj + j fl l
ALord
w (xi , yi ) = max fyi = max + max fyi , (5)
j,l{1,...,|Y|} 2 j 2 l 2
where fj = w (xi , j) for all j {1, . . . , |Y|}.
f +f +jl
Proof sketch. Let j , l be the solution of argmaxj,l{1,...,|Y|} j 2l , we show that the Nash
equilibrium value of a game matrix that contains only row j and l and column j and l from
f ,+fl +j l
matrix L0xi ,w is exactly j 2 . We then show that adding other rows and columns in L0xi ,w
to the game matrix does not change the game value. Given the resulting closed form solution of the
game (instead of a minimax), we can recast the adversarial framework for ordinal regression as an
empirical risk minimization with the proposed loss.
We note that the ALordw surrogate is the maximization over pairs of different potential functions
associated with each class (including pairs of identical class labels) added to the distance between the
pair. For both of our feature representations, we make use of the fact that maximization over each
element of the pair can be independently realized, as shown on the right-hand side of Eq. (5).
5
(a) (b) (c)
Figure 3: Loss function contour plots of ALord over the space of potential differences j , fj fyi
for the prediction task with three classes when the true label is yi = 1 (a), yi = 2 (b), and yi = 3 (c).
We can also view this as the maximization over |Y|(|Y| + 1)/2 linear hyperplanes. For an ordinal
regression problem with three classes, the loss has six facets with different shapes for each true label
value, as shown in Figure 3. In contrast with ALord-th , the class label potentials for ALord-mc may
differ from one another in more-or-less arbitrary ways. Thus, searching for the maximal j and l class
labels requires O(|Y|) time.
The behavior of a prediction method in ideal learning settingsi.e., trained on the true evaluation
distribution and given an arbitrarily rich feature representation, or, equivalently, considering the space
of all measurable functionsprovides a useful theoretical validation. Fisher consistency requires that
the prediction model yields the Bayes optimal decision boundary [32, 33, 35] in this setting. Given
the true label conditional probability Pj (x) , P (Y = j|x), a surrogate loss function is said to
be Fisher consistent with respect to the loss ` if the minimizer f of the surrogate loss achieves the
Bayes optimal risk, i.e.,:
f = argmin EY |XP [f (X, Y )|X = x] (8)
f
EY |XP [`f (X, Y )|X = x] = min EY |XP [`f (X, Y )|X = x] .
f
Ramaswamy and Agarwal [36] provide a necessary and sufficient condition for a surrogate loss to be
Fisher consistent with respect to general multiclass losses, which includes ordinal regression losses.
A recent analysis by Pedregosa et al. [35] shows that the all thresholds and the immediate thresholds
methods are Fisher consistent provided that the base binary surrogates losses they use are convex
with a negative derivative at zero.
For our proposed approach, the condition for Fisher consistency above is equivalent to:
X fj + fl + j l X
f = argmin Py max fy argmax fj (x) argmin Py |j y| . (9)
f y
j,l 2 j j
y
Since adding a constant to all fj does not change the value of both ALord
f and argmaxj fj (x), we
employ the constraint maxj fj (x) = 0, to remove redundant solutions for the consistency analysis.
We establish an important property of the minimizer for ALord
f in the following theorem.
ord
Theorem 2. The minimizer vector f of EY |XP ALf (X, Y )|X = x satisfies the loss reflective
property, i.e., it complements the absolute error by starting with a negative integer value, then
increasing by one until reaching zero, and then incrementally decreases again.
Proof sketch. We show that for any f 0 that does not satisfy the loss reflective property, we can
construct f 1 using several steps that satisfy the loss reflective property and has the expected loss value
less than the expected loss of f 0 .
6
Example vectors f that satisfy Theorem 2 are [0, 1, 2]T , [1, 0, 1]T and [2, 1, 0]T for
three-class problems, and [3, 2, 1, 0, 1] for five-class problems. Using the key property of the
minimizer above, we establish the consistency of our loss functions in the following theorem.
Theorem 3. The adversarial ordinal regression surrogate loss ALord from Eq. (5) is Fisher consistent.
Proof sketch. We only consider |Y| possible values of f that satisfy the loss reflective property. For
the f that corresponds to class j, the value of the expected loss is equal to the Bayes loss if we predict
j as the label. Therefore minimizing over f that satisfy the loss reflective property is equivalent to
finding the Bayes optimal response.
3.5 Optimization
Note that our dual formulation only depends on the dot product of the features. Therefore, we can
also easily apply the kernel trick to our algorithm. Appendix D describes the derivation from the
primal optimization to the dual optimization above.
7
Table 3: The average of the mean absolute error (MAE) for each model. Bold numbers in each case
indicate that the result is the best or not significantly worse than the best (paired t-test with = 0.05).
Threshold-based models Multiclass-based models
Dataset
ord-th th ord-mc
AL RED AT IT AL REDmc CSOSR CSOVO CSOVA
diabetes 0.696 0.715 0.731 0.827 0.629 0.700 0.715 0.738 0.762
pyrimidines 0.654 0.678 0.615 0.626 0.509 0.565 0.520 0.576 0.526
triazines 0.607 0.683 0.649 0.654 0.670 0.673 0.677 0.738 0.732
wisconsin 1.077 1.067 1.097 1.175 1.136 1.141 1.208 1.275 1.338
machinecpu 0.449 0.456 0.458 0.467 0.518 0.515 0.646 0.602 0.702
autompg 0.551 0.550 0.550 0.617 0.599 0.602 0.741 0.598 0.731
boston 0.316 0.304 0.306 0.298 0.311 0.311 0.353 0.294 0.363
stocks 0.324 0.317 0.315 0.324 0.168 0.175 0.204 0.147 0.213
abalone 0.551 0.547 0.546 0.571 0.521 0.520 0.545 0.558 0.556
bank 0.461 0.460 0.461 0.461 0.445 0.446 0.732 0.448 0.989
computer 0.640 0.635 0.633 0.683 0.625 0.624 0.889 0.649 1.055
calhousing 1.190 1.183 1.182 1.225 1.164 1.144 1.237 1.202 1.601
average 0.626 0.633 0.629 0.661 0.613 0.618 0.706 0.652 0.797
# bold 5 5 4 2 5 5 2 2 1
The baselines we use for the threshold-based models include a SVM-based reduction framework
algorithm (REDth ) [5], an all threshold method with hinge loss (AT) [1, 2], and an immediate threshold
method with hinge loss (IT) [1, 2]. For the multiclass-based models, we compare our method with an
SVM-based reduction algorithm using multiclass features (REDmc ) [5], with cost-sensitive one-sided
support vector regression (CSOSR) [8], with cost-sensitive one-versus-one SVM (CSOVO) [7], and
with cost-sensitive one-versus-all SVM (CSOVA) [6]. For our Gaussian kernel experiment, we
compare our threshold based model (ALord-th ) with SVORIM and SVOREX [2].
In our experiments, we first make 20 random splits of each dataset into training and testing sets. We
performed two stages of five-fold cross validation on the first split training set for tuning each models
regularization constant . In the first stage, the possible values for are 2i , i = {1, 3, 5, 7, 9, 11, 13}.
i
Using the best in the first stage, we set the possible values for in the second stage as 2 2 0 , i =
{3, 2, 1, 0, 1, 2, 3}, where 0 is the best parameter obtained in the first stage. Using the selected
parameter from the second stage, we train each model on the 20 training sets and evaluate the MAE
performance on the corresponding testing set. We then perform a statistical test to find whether the
performance of a model is different with statistical significance from other models. We perform the
Gaussian kernel experiment similarly with model parameter C equals to 2i , i = {0, 3, 6, 9, 12} and
kernel parameter equals to 2i , i = {12, 9, 6, 3, 0} in the first stage. In the second stage, we
set C equals to 2i C0 , i = {2, 1, 0, 1, 2} and equals to 2i 0 , i = {2, 1, 0, 1, 2}, where C0
and 0 are the best parameters obtained in the first stage.
4.2 Results
We report the mean absolute error (MAE) averaged over the dataset splits as shown in Table 3 and
Table 4. We highlight the result that is either the best or not worse than the best with statistical
significance (under paired t-test with = 0.05) in boldface font. We also provide the summary for
each model in terms of the averaged MAE over all datasets and the number of datasets for which
each model marked with boldface font in the bottom of the table.
As we can see from Table 3, in the experiment with the original feature space, threshold-based
models perform well on relatively small datasets, whereas multiclass-based models perform well on
relatively large datasets. A possible explanation for this result is that multiclass-based models have
more flexibility in creating decision boundaries, hence they perform better if the training data size is
sufficient. However, since multiclass-based models have many more parameters than threshold-based
models (m|Y| parameters rather than m + |Y| 1 parameters), multiclass methods may need more
data, and hence, may not perform well on relatively small datasets.
In the threshold-based models comparison, ALord-th , REDth , and AT perform competitively on
relatively small datasets like triazines, wisconsin, machinecpu, and autompg. ALord-th has a
8
slight advantage over REDth on the overall accuracy, and a slight advantage over AT on the number
of indistinguishably best performance on all datasets. We can also see that AT is superior to IT in
the experiments under the original feature space.
Among the multiclass-based models, ALord-mc Table 4: The average of MAE for models with
and REDmc perform competitively on datasets Gaussian kernel.
like abalone, bank, and computer, with a
slight advantage of ALord-mc model on the over- Dataset ALord-th SVORIM SVOREX
all accuracy. In general, the cost-sensitive mod- diabetes 0.696 0.665 0688
els perform poorly compared with ALord-mc and pyrimidines 0.478 0.539 0.550
REDmc . A notable exception is the CSOVO triazines 0.609 0.612 0.604
model which perform very well on the stocks wisconsin 1.090 1.113 1.049
and the boston datasets. machinecpu 0.452 0.652 0.628
autompg 0.529 0.589 0.593
In the Gaussian kernel experiment, we can see boston 0.278 0.324 0.316
from Table 4 that the kernelized version of stocks 0.103 0.099 0.100
ALord-th performs significantly better than the average 0.531 0.574 0.566
threshold-based models SVORIM and SVOREX # bold 7 3 4
in terms of both the overall accuracy and the
number of indistinguishably best performance
on all datasets. We also note that immediate-threshold-based model (SVOREX) performs better than
all-threshold-based model (SVORIM) in our experiment using Gaussian kernel. We can conclude
that our proposed adversarial losses for ordinal regression perform competitively compared to the
state-of-the-art ordinal regression models using both original feature spaces and kernel feature spaces
with a significant performance improvement in the Gaussian kernel experiments.
Acknowledgments
This research was supported as part of the Future of Life Institute (futureoflife.org) FLI-RFP-AI1
program, grant#2016-158710 and by NSF grant RI-#1526379.
9
References
[1] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In
Advances in Neural Information Processing Systems 15, pages 961968. MIT Press, 2003.
[2] Wei Chu and S Sathiya Keerthi. New approaches to support vector ordinal regression. In
Proceedings of the 22nd international conference on Machine learning, pages 145152. ACM,
2005.
[3] Hsuan-Tien Lin and Ling Li. Large-margin thresholded ensembles for ordinal regression:
Theory and practice. In International Conference on Algorithmic Learning Theory, pages
319333. Springer, 2006.
[4] Jason D. M. Rennie and Nathan Srebro. Loss functions for preference levels: Regression with
discrete ordered labels. In Proceedings of the IJCAI Multidisciplinary Workshop on Advances
in Preference Handling, pages 180186, 2005.
[5] Ling Li and Hsuan-Tien Lin. Ordinal regression by extended binary classification. Advances in
neural information processing systems, 19:865, 2007.
[6] Hsuan-Tien Lin. From ordinal ranking to binary classification. PhD thesis, California Institute
of Technology, 2008.
[7] Hsuan-Tien Lin. Reduction from cost-sensitive multiclass classification to one-versus-one
binary classification. In Proceedings of the Sixth Asian Conference on Machine Learning, pages
371386, 2014.
[8] Han-Hsing Tu and Hsuan-Tien Lin. One-sided support vector regression for multiclass cost-
sensitive classification. In Proceedings of the 27th International Conference on Machine
Learning (ICML-10), pages 10951102, 2010.
[9] Peter D. Grnwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrep-
ancy, and robust Bayesian decision theory. Annals of Statistics, 32:13671433, 2004.
[10] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classifi-
cation. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2015.
[11] Subhash C Narula and John F Wellington. The minimum sum of absolute errors regression:
A state of the art survey. International Statistical Review/Revue Internationale de Statistique,
pages 317326, 1982.
[12] Koby Crammer and Yoram Singer. Pranking with ranking. In Advances in Neural Information
Processing Systems 14, 2001.
[13] Peter McCullagh. Regression models for ordinal data. Journal of the royal statistical society.
Series B (Methodological), pages 109142, 1980.
[14] Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. Journal of
Machine Learning Research, 6(Jul):10191041, 2005.
[15] Krzysztof Dembczynski, Wojciech Kotowski, and Roman Sowinski. Ordinal classification
with decision rules. In International Workshop on Mining Complex Data, pages 169181.
Springer, 2007.
[16] Mark J Mathieson. Ordinal models for neural networks. Neural networks in financial engineer-
ing, pages 523536, 1996.
[17] Shipeng Yu, Kai Yu, Volker Tresp, and Hans-Peter Kriegel. Collaborative ordinal regression.
In Proceedings of the 23rd international conference on Machine learning, pages 10891096.
ACM, 2006.
[18] Jianlin Cheng, Zheng Wang, and Gianluca Pollastri. A neural network approach to ordinal
regression. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational
Intelligence). IEEE International Joint Conference on, pages 12791284. IEEE, 2008.
[19] Wan-Yu Deng, Qing-Hua Zheng, Shiguo Lian, Lin Chen, and Xin Wang. Ordinal extreme
learning machine. Neurocomputing, 74(1):447456, 2010.
10
[20] Bing-Yu Sun, Jiuyong Li, Desheng Dash Wu, Xiao-Ming Zhang, and Wen-Bo Li. Kernel
discriminant learning for ordinal regression. IEEE Transactions on Knowledge and Data
Engineering, 22(6):906910, 2010.
[21] Jaime S Cardoso and Joaquim F Costa. Learning to classify ordinal data: The data replication
method. Journal of Machine Learning Research, 8(Jul):13931429, 2007.
[22] Yang Liu, Yan Liu, and Keith CC Chan. Ordinal regression via manifold learning. In Pro-
ceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pages 398403. AAAI
Press, 2011.
[23] Flemming Topse. Information theoretical optimization techniques. Kybernetika, 15(1):827,
1979.
[24] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620630,
1957.
[25] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for
multivariate losses. In Advances in Neural Information Processing Systems, pages 27102718,
2015.
[26] Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in
Neural Information Processing Systems, pages 3745, 2014.
[27] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classification:
A risk minimization perspective. In Advances in Neural Information Processing Systems 29,
pages 559567. 2016.
[28] Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in
Neural Information Processing Systems, pages 42334241. 2016.
[29] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-
based vector machines. The Journal of Machine Learning Research, 2:265292, 2002.
[30] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large
margin methods for structured and interdependent output variables. In JMLR, pages 14531484,
2005.
[31] Thorsten Joachims. A support vector method for multivariate performance measures. In
Proceedings of the International Conference on Machine Learning, pages 377384, 2005.
[32] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods.
The Journal of Machine Learning Research, 8:10071025, 2007.
[33] Yufeng Liu. Fisher consistency of multicategory support vector machines. In International
Conference on Artificial Intelligence and Statistics, pages 291298, 2007.
[34] Miroslav Dudk and Robert E Schapire. Maximum entropy distribution estimation with gener-
alized regularization. In International Conference on Computational Learning Theory, pages
123138. Springer, 2006.
[35] Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal
regression methods. Journal of Machine Learning Research, 18(55):135, 2017.
[36] Harish G Ramaswamy and Shivani Agarwal. Classification calibration dimension for general
multiclass losses. In Advances in Neural Information Processing Systems, pages 20782086,
2012.
[37] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic
average gradient. Mathematical Programming, pages 130, 2013.
[38] Mark Schmidt, Reza Babanezhad, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-uniform
stochastic average gradient method for training conditional random fields. 2015.
[39] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/
ml.
11
Supplementary Materials
A Proof for the Adversarial Ordinal Regression Loss (Theorem 1)
Before proving Theorem 1, we review the game matrix L0xi ,w for ordinal regression problems. Below
is the matrix when the number of classes is four:
f1 fyi f2 fyi + 1 f3 fyi + 2 f4 fyi + 3
f fyi + 1 f2 fyi f3 fyi + 1 f4 fyi + 2
L0xi ,w = 1 (11)
f1 fyi + 2 f2 fyi + 1 f3 fyi f4 fyi + 1
f1 fyi + 3 f2 fyi + 2 f3 fyi + 1 f4 fyi
f1 f2 + 1 f3 + 2 f4 + 3
f + 1 f2 f3 + 1 f4 + 2
= 1 fyi (12)
f1 + 2 f2 + 1 f3 f4 + 1
f1 + 3 f2 + 2 f3 + 1 f4
= L00xi ,w fyi . (13)
Proof. Our proof strategy is to use the inequalities implied by the definition of ALord
w and then show
that the value of ALord
w is equal to the game value of sub-matrices of L 0
xi ,w . We start by showing
the equality for a small 2 by 2 sub-matrix and build up until we show that the value of ALord w is
0 ord
indeed equal to the game value of the whole game matrix Lxi ,w . Empirically minimizing ALw will
conclude the theorem.
Let us begin the proof by denoting v(G) as the Nash equilibrium value of a game characterized
by game matrix G. We would like to prove that for a zero-sum game characterized by L0xi ,w as
f +f +jl
described in Eq. (3), v(L0xi ,w ) = maxj,l{1,...,|Y|} j 2l fyi .
Note that for any game matrix G and any constant c, v(G + c) = v(G) + c. We de-
note L00xi ,w = L0xi ,w + fyi . Thus, proving the theorem is equivalent to proving v(L00xi ,w ) =
f +f +jl
maxj,l{1,...,|Y|} j 2l . The matrix L00xi ,w is similar to the matrix in Eq. (3), but without
including the fyi term in each its cells, i.e.,
f1 f2 + 1 f|Y|1 + |Y| 2 f|Y| + |Y| 1
f1 + 1 f2 f|Y|1 + |Y| 3 f|Y| + |Y| 2
L00xi ,w =
.. .. .. .. ..
. (15)
. . . . .
f + |Y| 2 f + |Y| 3 f f +1
1 2 |Y|1 |Y|
f1 + |Y| 1 f2 + |Y| 2 f|Y|1 + 1 f|Y|
fj +fl +jl
Let j and l be the solution of argmaxj,l{1,...,|Y|} 2 (if there are ties, pick any of them)
f +f +jl fj +fl +j l
and let u = maxj,l{1,...,|Y|} j 2l = 2 . We know the following inequalities
hold:
fj + fl + j l fj + fl + j l, j, l {1, . . . , |Y|} (16)
fj + j fj + j, j {1, . . . , |Y|} (17)
fl l fl l, l {1, . . . , |Y|}. (18)
We also know that j l ; otherwise, we could just swap them to obtain a larger value.
We first focus on the cases where j 6= l . We analyze three different games that are characterized by
subsets of matrix L00xi ,w and show that the value of those games is u .
12
Case 1: Let G1 be a game characterized by a 2 by 2 matrix with values that are taken from rows and
columns j and l of matrix L00xi ,w , i.e.,
fj + j l
fl
G1 = . (19)
fl + j l fj
We will show that v(G1 ) = u . Let p be the vector of adversarys mixed strategy, then finding v(G1 )
is equivalent with solving the following optimization:
max V (20)
s.t. V pl fl + pj (fj + j l ) = pl fl + pj fj + pj (j l )
V pl (fl + j l ) + pj fj = pl fl + pj fj + pl (j l ).
We now analyze the optimization above. Let pl = 0.5 and pj = 0.5 + for some where
0.5 0.5. The optimization above become:
max V (21)
s.t. V (0.5 )fl + (0.5 + )fj + (0.5 + )(j l )
= 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )]
V (0.5 )fl + (0.5 + )fj + (0.5 )(j l )
= 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] .
Since j 6= l , based on Eq. (16), we know that:
fj + fl + j l fj + fj + j j (fj fl ) (j l ) 0, (22)
fj + fl + j l fl + fl + l l (fj fl ) + (j l ) 0. (23)
Therefore, the optimal solution is to set = 0, since setting nonzero will decrease the right-hand
side of one of the constraints and hence decrease the value of V . Thus, the solution is achieved when
f +fl +j l
we set pl = pj = 0.5, which results in a game value of j 2 = u .3
Case 2: Let G2 be a game characterized by a |Y| by 2 matrix with values that are taken from column
j and l of matrix L00xi ,w , i.e.,
fl + l 1 fj + j 1
.. ..
. .
fj + j l
fl
fl + 1 fj + j l 1
G2 =
.. ..
. (24)
. .
f + j l 1 f + 1
l j
f + j l fj
l
.. ..
. .
fl + |Y| l fj + |Y| j
Finding v(G2 ) is equivalent to solving a similar optimization to that of Eq (20) with |Y| constraints
corresponding to each row of matrix G2 instead of just two. We can easily see that the solution is
achieved if we set pl = pj = 0.5 as in the previous case. The right hand side of any m-th constraint
m < l or m > j is dominated, i.e., it has value greater than or equal to u , and the right hand
side of any m-th constraint l < m < l is equal to u . Assigning other values to pl and pj will
decrease the right-hand side of some of the m-th (l m j ) constraints (as explained in case 1),
and hence decrease the value of V . Therefore, we can conclude that v(G2 ) = u .
Case 3: Let G3 be a game characterized by a |Y| by 3 matrix with values that are taken from columns
j , l , and any other column m in matrix L00xi ,w . We consider three variations of the game, G13 where
3
In this analysis and other analyses in this proof, we omit the analysis for the trivial cases where the terms
associated with (in the case above: (fj fl ) + (j l ) and (fj fl ) (j l )) are zero. In this
case, the value of can be anything, but the game value remain the same.
13
m < l , G23 where l < m < j , and G33 where m > j . Below is the game matrix for the first
variation:
.. .. ..
. . .
fm fl + l m fj + j m
.. .. ..
. . .
G13 = fm + l m fl fj + j l . (25)
.. .. ..
. . .
f + j m f + j l fj
m l
.. .. ..
. . .
Let us analyze the optimization for finding the game value for G13 , in particular the l -th and j -th
constraints:
max V (26)
..
s.t. .
V pm (fm + l m) + pl fl + pj (fj + j l )
V pm (fm + j m) + pl (fl + j l ) + pj fj
..
..
Let us use the notation similar to Case 1. Let pm = , pl = 0.5 and pj = 0.5 + where
0.5 0.5; 0 1; and 0.5 + 0.5. We can write the constraints above as:
V 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )] + [(fm m) (fl l )]
V 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] + [(fm m) (fl l )] .
Since (fj fl ) + (j l ) 0; (fj fl ) (j l ) 0; and (fm m) (fl l ) 0,
the optimal solution is setting = 0, and = 0. Since pm = = 0, we leave with the same game
matrix as G2 . Therefore v(G13 ) = u .
For G33 , we let pm = , pl = 0.5 and pj = 0.5 + where 0.5 0.5; 0 1;
and 0.5 0.5. Similar to the previous case, l -th and j -th constraints can be written as:
V 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )] + [(fm + m) (fj + j )]
V 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] + [(fm + m) (fj + j )] .
Due to a similar reason as in the previous case, and (fm + m) (fj + j ) 0, the optimal solution
is to set = 0, and = 0, and hence v(G33 ) = u .
For G23 , we will analyze the l -th, m-th, and j -th constraint. Let pm = , pl = 0.5 and
pj = 0.5 + where 0.5 0.5; 0 1; and 0.5 0.5. The constraints
can be written as:
V 0.5 (fl + fj + j l ) + [(fj fl ) + (j l )] + [(fm + m) (fj + j )]
V 0.5 (fl + fj + j l ) + [fj fl + j + l 2m] + [(fm + m) (fj + j )]
V 0.5 (fl + fj + j l ) + [(fj fl ) (j l )] + [(fm m) (fj j )] .
We know that (fj fl )+(j l ) 0; (fj fl )(j l ) 0 and (fm +m)(fj +j ) 0.
If it is the case that fj fl + j + l 2m 0, or (fm m) (fj j ) 0, or both, it will force
both and to be 0. If both of them are positive, we need an additional analysis as the following.
We focus on the m-th, and j -th constraints. Since we want to check if there is a combination of
and values that make the game value greater than u , and have to satisfy the following:
[fj fl + j + l 2m] + [(fm + m) (fj + j )] 0 (27)
(fj + j ) (fm + m) (fj + j ) (fm m) 2m
= , (28)
fj fl + j + l 2m (fj + j ) (fl l ) 2m
14
[(fj fl ) (j l )] + [(fm m) (fj j )] 0 (29)
(j l ) (fj fl ) (fl l ) (fj j )
= . (30)
(fm m) (fj j ) (fm m) (fj j )
We know that (fj +j )(fm m)2m (fj +j )(fl l )2m, and (fl l )(fj j )
(fm m) (fj j ). If at least one of those inequalities is strict, e.g., the first inequality, it is
better to set = = 0, since in order to increase the value of RHS of the m-th constraint has to
be strictly greater than , which will decrease the RHS of the j -th constraint and thus decrease the
game value. If both are equal, then many solutions exist, i.e., = , but the game value remains
the same, i.e. u , since in this case [fj fl + j + l 2m] + [(fm + m) (fj + j )] =
[(fj fl ) (j l )] + [(fm m) (fj j )] = 0. Therefore v(G23 ) = u .
Note that we omit the analysis for the trivial cases when the terms associated with and are zero.
In those cases, any value of and will satisfy the constraints, but the game value remain the same.
Conclusion: We are now ready to analyze the game value for L00xi ,w . Since adding any column
m {1, . . . , |Y|}\{l , j } to G2 will not change the game value, then adding the combination of
them will not change the game value either. Therefore, we can conclude that v(L00xi ,w ) = u .
f +f +jl
For the case that j = l , we know that maxj,l{1,...,|Y|} j 2l = fj . It is clear that fj is the
solution for the game that is defined by column j from matrix L00xi ,w . For any other column m, if we
include it in the game, the corresponding j -th constraint become (we let pm = , and pj = 1 ):
V fj + [(fm m) (fj j )] if m < j , or (31)
V fj + [(fm + m) (fj + j )] if m > j . (32)
Since we know that (fj j ) (fm m), and (fj + j ) (fm + m), the optimal solution is
to set = 0, and the game value remain the same. We can also generalize it to all combination of
column m {1, . . . , |Y|}\{j } to show that v(L00xi ,w ) = fj = u .
Therefore, we can conclude that the value of the game matrix v(L00xi ,w ) =
f +f +jl
maxj,l{1,...,|Y|} j 2l , which proves the theorem.
Proof. We start the proof by analyzing the minimizer f using Py , P (y|x) as follows:
f = argmin EY |XP ALord
f (X, Y )|X = x (33)
f
X fj + fl + j l
= argmin Py max fy (34)
f y
j,l{1,...,|Y|} 2
" #
X fj + fl + j l X
= argmin Py max Py fy (35)
f y
j,l{1,...,|Y|} 2 y
" #
fj + fl + j l X
= argmin max Py fy . (36)
f j,l{1,...,|Y|} 2 y
In this proof, we employ a constraint to the potential function, max fj (x) = 0, in order to remove
ord as adding any constant c to f does not change the value of both argmax fj (x),
redundant solutions,
and EY |XP ALf (X, Y )|X = x :
fj + c + fl + c + j l X
max Py (fy + c) (37)
j,l{1,...,|Y|} 2 y
15
fj + fl + j l X
=c + max c Py (fy ) (38)
j,l{1,...,|Y|} 2 y
fj + fl + j l X
= max Py (fy ). (39)
j,l{1,...,|Y|} 2 y
f +f +jl
Let j and l be the solution of argmaxj,l{1,...,|Y|} j 2l . We will start from the first
case whereh j = l . In this case,
i the minimization in Eq. (36) can be reduced to
P
argminf maxj{1,...,|Y|} fj y Py fy . Since j = l , we know that the following inequali-
ties hold:
fj fj j {1, . . . , |Y|} (40)
fj + j fj + j, j {1, . . . , |Y|} (41)
fj j fj j, j {1, . . . , |Y|}. (42)
Therefore, by Eq. (40) and constraint max fj (x) = 0, we have fj = 0. Then by Eq. (41), for any
i > 0, fj +i fj i = i;
P and also by PEq. (42), for any i > 0, fj i fj i = i. Since
we want to minimize fj y Py fy = y Py fy , the optimal solution is to set fj +i = i and
fj i = i for any i > 0. Therefore we get vector f that satisfies the loss reflective property, i.e., it
complements the absolute error by starting with a negative integer value, then increasing by one until
reaching zero, and then incrementally decreases again.
We next analyze the second case where j 6= l . In this case, the following inequalities hold:
fj + j fj +i + j + i fj +i fj i, i {j + 1, . . . , |Y| j } (43)
fl l fl +i l i fl +i fl + i, i {l + 1, . . . , |Y| l }. (44)
We also know that for any m {1, . . . , |Y|} the following holds:
m < l fm fl (l m) and fm fj + (j m) (45)
m > j fm fj (m j ) and fm fl + (m l ) (46)
l < m < j fm fl + (m l ) and fm fj + (j m).
(47)
The relation between fj and fl in the following also holds:
fj fl + j l (48)
fl fj + j l . (49)
Let f 0 be any potential function which falls into the second case (the solution of (j , l ) =
fj0 +fl0 +jl
argmaxj,l{1,...,|Y|} 2 satisfies j 6= l ) where f 0 does not satisfy the loss reflective prop-
f +fl +j l
y Py fy . We will show that we can construct f 1 as
P
erty. Let us define h(f ) = j 2
follows. Starting from f 1 = f 0 we increase all the values of fm 1
for m {1, . . . , |Y|}\{l , j }
such that it satisfies the constraints above with equality for the one that has minimum value. For
example, in a 7-class ordinal regression where l = 2 and j = 6, one of possible value for f 0
is [3, 1.4, 0.8, 0.2, 0.7, 0, 1.2]T which satisfies all the constraints above. In this case f 1
f +fl +j l
will be [2.4, 1.4, 0.4, 0.6, 1, 0, 1]T . Since the value of j 2 remains the same and
the value of y Py fy is increasing, we know that h(f 1 ) < h(f 0 ). We know that in f 1 , fj fj1
P
is equal to 1 or -1, except for a pair (a, b), where l a < b j . In the example above
f 1 +f 1 +j l f 1 +f 1 +1
a = 4, b = 5, fa1 = 0.6, and fb1 = 1. We also know that j l
2 = a 2b .
a
We now construct f 2 from f 1 as follows. If y=1 Py 0.5, we set fj2 = fj1 (fa1 fb1 + 1) for
P
j {1, . . . , a} and set fj = fj for j {b, . . . , |Y|}; otherwise we set fj2 = fj1 for j {1, . . . , a}
2 1
Pa
and set fj2 = fj1 (fb1 fa1 + 1) for j {b, . . . , |Y|}. For the example above, if y=1 Py 0.5
then f 2 = [3, 2, 1, 0, 1, 0, 1], otherwise f 2 = [2.4, Pa 1.4, 0.4, 0.6, 0.4, 1.4, 2.4]. We
claim that h(f 2 ) h(f 1 ) as shown for the case that y=1 Py 0.5 (the other case follows in a
similar way):
fj2 + fl2 + j l X X
h(f 2 ) = max Py fy2 = fb2 Py fy2 (50)
j,l{1,...,|Y|} 2 y y
16
a |Y|
X X
= fb2 Py fy2 Py fy2 (51)
y=1 y=b
a |Y|
X X
= fb1 Py fy1 (fa1 fb1 + 1) Py fy1
(52)
y=1 y=b
a
X X
fb1 Py fa1 fb1 + 1 Py fy1
= + (53)
y=1 y
X f 1 + fb1 + 1 X
fb1 + 0.5 fa1 fb1 + 1 Py fy1 = a Py fy1 = h(f 1 ).
(54)
y
2 y
Finally, we construct f 3 = f 2 maxj fj2 . Since adding a constant to any f does not change the value of
h(f ), we know that h(f 3 ) = h(f 2 ). P
We also know that f 3 satisfies the loss reflective property described
a
above. As an example, in the case y=1 Py 0.5, then f 3 = [4, 3, 2, 1, 0, 1, 2].
Since for any f 0 that falls into the second case where the solution for (j , l ) =
f 0 +f 0 +jl 0
argmaxj,l{1,...,|Y|} j l2 satisfies j 6= l and f does not satisfy the loss reflective property,
we can construct f which satisfies the loss reflective property and having the value of h(f 3 ) < h(f 0 ),
3
then f 0 cannot be the minimizer. Therefore, we can conclude that in the first and second cases, the
minimizer has to satisfy the loss reflective property which complete the proof of the theorem.
Theorem 3. The adversarial ordinal regression surrogate loss ALord from Eq. (5) is Fisher consistent.
where gi is the loss gradient with respect to i-th example. The idea of SAG is to use the gradient
of each example from the last iteration where it was selected to take a step. However, the nave
implementation of SAG requires storing the gradient of each sample, which may be expensive in
terms of the memory requirements.
17
Algorithm 1 SAG for adversarial ordinal regression with multiclass representation
1: Input: training dataset with pairs {xi , yi }, learning rate , regularization constant
2: m0 {the number of sampled pairs so far}
Pm
3: d0 {for storing i=1 gi }
4: ji 0, li 0 for i = 1, 2, . . . , n
5: repeat
6: Sample i from {1, . . . , n}
w x +w x +jl
7: j , l argmaxj,l j i 2l i wyi xi
8: if it is the first time we sample i then
9: mm+1
10: dj dj + 21 xi , dl dl + 21 xi
11: dyi dyi xi
12: else
13: dji dji 12 xi , dli dli 12 xi
14: dj dj + 12 xi , dl dl + 21 xi
15: end if
16: ji j , li l
17: w (1 )w m d
18: until converge
Based on Equation 5, the primal optimization of regularized adversarial ordinal regression loss can
be written as:
n
1 X w (xi , j) + j w (xi , j) j
min kwk2 + C max + max w (xi , yi )
w 2 i=1
j1,...,|Y| 2 j1,...,|Y| 2
(57)
n
1 CX
= min kwk2 + max (w (xi , j) w (xi , yi ) + j) (58)
w 2 2 i=1 j1,...,|Y|
n
CX
+ max (w (xi , j) w (xi , yi ) j) .
2 i=1 j1,...,|Y|
18
The Lagrangian for the optimization above is:
n n n X|Y|
1 CX CX X
L = kwk2 + i + i i,j [i w (xi , j) + w (xi , yi ) j] (60)
2 2 i=1 2 i=1 i=1 j=1
|Y|
n X
X
i,j [i w (xi , j) + w (xi , yi ) + j].
i=1 j=1
Rearranging the Lagrangian formula and then plugging the definition of w in terms of the dual
variables and applying the constraints yields:
|Y|
n X
X
L= j(i,j i,j ) (61)
i=1 j=1
n |Y|
1 X X
(i,j + i,j ) (k,l + k,l ) ((xi , j) (xi , yi )) ((xk , l) (xl , yk )) .
2
i,k=1 j,l=1
19