0% found this document useful (0 votes)
51 views32 pages

Nonparametric Quantile Regression: Ichiro Takeuchi Quoc V. Le Tim Sears Alexander J. Smola

This document summarizes a research paper on nonparametric quantile estimation. The paper presents a nonparametric approach to quantile regression that can be solved using quadratic programming. Experimental results show the feasibility and competitiveness of the proposed method compared to existing approaches. The paper also derives theoretical guarantees on the quantile property of the estimator and discusses extensions like handling quantile crossing and incorporating monotonicity constraints.

Uploaded by

Manny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views32 pages

Nonparametric Quantile Regression: Ichiro Takeuchi Quoc V. Le Tim Sears Alexander J. Smola

This document summarizes a research paper on nonparametric quantile estimation. The paper presents a nonparametric approach to quantile regression that can be solved using quadratic programming. Experimental results show the feasibility and competitiveness of the proposed method compared to existing approaches. The paper also derives theoretical guarantees on the quantile property of the estimator and discusses extensions like handling quantile crossing and incorporating monotonicity constraints.

Uploaded by

Manny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Journal of Machine Learning Research Nonparamteric Quantile Estimation (2005) 7 Submitted 10/2005; Published 12/2099

Nonparametric Quantile Regression

Ichiro Takeuchi TAKEUCHI @ PA . INFO . MIE - U . AC . JP


Dept. of Information Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu 514-8507, Japan
Quoc V. Le Q UOC .L E @ ANU . EDU . AU
Tim Sears T IM .S EARS @ ANU . EDU . AU
Alexander J. Smola A LEX .S MOLA @ NICTA . COM . AU
National ICT Australia and the Australian National University, Canberra ACT, Australia

Editor: U.N. Known

Abstract
In regression, the desired estimate of y|x is not always given by a conditional mean, although this is most
common. Sometimes one wants to obtain a good estimate that satisfies the property that a proportion, τ , of
y|x, will be below the estimate. For τ = 0.5 this is an estimate of the median. What might be called median
regression, is subsumed under the term quantile regression. We present a nonparametric version of a quantile
estimator, which can be obtained by solving a simple quadratic programming problem and provide uniform
convergence statements and bounds on the quantile property of our estimator. Experimental results show the
feasibility of the approach and competitiveness of our method with existing ones. We discuss several types of
extensions including an approach to solve the quantile crossing problems, as well as a method to incorporate
prior qualitative knowledge such as monotonicity constraints.

1. Introduction
Regression estimation is typically concerned with finding a real-valued function f such that its values f (x)
correspond to the conditional mean of y, or closely related quantities. Many methods have been developed
for this purpose, e.g. least mean square (LMS) regression (Vinod (1978)), robust regression (Huber (1981)),
or -insensitive regression (Vapnik (1995); Vapnik et al. (1997)). Regularized variants include Wahba (1990),
penalized by a Reproducing Kernel Hilbert Space (RKHS) norm, and Hoerl and Kennard (1970), regularized
via ridge regression.

1.1 Motivation
While these estimates of the mean serve their purpose, there exists a large area of problems where we are
more interested in estimating a quantile. That is, we might wish to know other features of the the distribution
of the random variable y|x:
• A device manufacturer may wish to know what are the 10% and 90% quantiles for some feature of the
production process, so as to tailor the process to cover 80% of the devices produced.
• For risk management and regulatory reporting purposes, a bank may need to estimate a lower bound
on the changes in the value of its portfolio which will hold with high probability.

2005
c Takeuchi, Le, Sears and Smola.
TAKEUCHI , L E , S EARS AND S MOLA

• A pediatrician requires a growth chart for children given their age and perhaps even medical back-
ground, to help determine whether medical interventions are required, e.g. while monitoring the
progress of a premature infant.
These problems are addressed by a technique called Quantile Regression (QR) championed by Koenker (see
Koenker (2005) for a description, practical guide, and extensive list of references). These methods have been
deployed in econometrics, social sciences, ecology, etc. The purpose of our paper is:
• To bring the technique of quantile regression to the attention of the machine learning community and
show its relation to ν-Support Vector Regression (Schölkopf et al. (2000)).
• To demonstrate a nonparametric version of QR which outperforms the currently available nonlinear
QR regression formations (Koenker (2005)). See Section 5 for details.

• To derive small sample size results for the algorithms. Most statements in the statistical literature for
QR methods are of asymptotic nature Koenker (2005). Empirical process results permit us to define
two quality criteria and show tail bounds for both of them in the finite-sample-size case.
• To extend the technique to permit commonly desired constraints to be incorporated. As examples we
show how to enforce non-crossing constraints and a monotonicity constraint. These constraints allow
us to incorporate prior knowlege on the data.

1.2 Notation and Basic Definitions:


In the following we denote by X, Y the domains of x and y respectively. X = {x1 , . . . , xm } denotes
the training set with corresponding targets Y = {y1 , . . . , ym }, both drawn independently and identically
distributed (iid) from some distribution p(x, y). With some abuse of notation y also denotes the vector of all
yi in matrix and vector expressions, whenever the distinction is obvious.
Unless specified otherwise H denotes a Reproducing Kernel Hilbert Space (RKHS) on X, k is the cor-
responding kernel function, and K ∈ Rm×m is the kernel matrix obtained via Kij = k(xi , xj ). θ denotes a
vector in feature space and φ(x) is the corresponding feature map of x. That is, k(x, x0 ) = hφ(x), φ(x0 )i.
Finally, α ∈ Rm is the vector of Lagrange multipliers.

Definition 1 (Quantile) Denote by y ∈ R a random variable and let τ ∈ (0, 1). Then the τ -quantile of y,
denoted by µτ is given by the infimum over µ for which Pr {y ≤ µ} = τ . Likewise, the conditional quantile
µτ (x) for a pair of random variables (x, y) ∈ X × R is defined as the function µτ : X → R for which
pointwise µτ is the infimum over µ for which Pr {y ≤ µ|x} = τ .

1.3 Examples
To illustrate regression analyses with conditional quantile functions, we provide two simple examples here.

1.3.1 A RTIFICIAL DATA


The definition of conditional quantiles may be best illustrated by simple example. Consider a situation where
x is drawn uniformly from [0, 1] and y is given by

y(x) = sin πx + ξ where ξ ∼ N 0, esin 2πx .




1002
N ONPARAMTERIC Q UANTILE E STIMATION

Figure 1: Illustration of the nonparametric quantile regression on toy dataset. On the left, τ = 0.9. On the
right, τ = 0.5 the quantile regression line approximates the median of the data very closely (since
ξ is normally distributed median and mean are identical).

Here the amount of noise is a function of the location. Since ξ is symmetric with mean and mode 0 we have
µ0.5 (x) = sin πx. Moreover, we can compute the quantiles by solving for Pr {y ≤ µ|x} = τ explicitly.
Since ξ is normal we know that the quantiles of ξ are given by Φ−1 (τ ) sin 2πx, where Φ is the cumulative
distribution function of the normal distribution with unit variance. This means that
µτ (x) = sin πx + Φ−1 (τ ) sin 2πx. (1)
Figure 1 shows two quantile estimates. We see that depending on the choice of the quantile, we obtain a
close approximation of the median (τ = 0.5), or a curve which tracks just inside the upper envelope of the
data (τ = 0.9). The error bars of many regression estimates can be viewed as crude quantile regressions:
one tries to specify the interval within which, with high probability, the data may lie. Note, however, that the
latter does not entirely correspond to a quantile regression: error bars just give an upper bound on the range
within which an estimate lies, whereas QR aims to estimate the exact boundary at which a certain quantile
is achieved. In other words, it corresponds to tight error bars.

1.3.2 R EAL DATA


The next example is based on actual measurements of bone density (BMD) in adolescents. The data was
originally reported in Bachrach et al. (1999) and is also analyzed in Hastie et al. (2001)1 . Figure 2 (a) shows
a regression analysis with conditional mean and figure 2 (b) shows that with a set of conditional quantiles
for the variable BMD.
The response in the vertical axis is relative change in spinal BMD and the covariate in the horizontal
axis is the age of the adolescents. The conditional mean analysis (a) provides only the central tendency of
the conditional distribution, while apparently the entire distribution of BMD changes according to age. The
conditional quantile analysis (b) gives us more detailed description of these changes. For example, we can
see that the variance of the BMD changes with the age (heteroscedastic) and that the conditional distribution
is slightly positively skewed.
1. The data is also available from the website http://www-stat.stanford.edu/ElemStatlearn

1003
TAKEUCHI , L E , S EARS AND S MOLA

(a) Conditional mean analysis (b) Conditional quantile analysis

Figure 2: An illustration of (a) conditional mean analysis and (b) conditional quantile analysis for a data set
on bone mineral density (BMD) in adolescents. In (a) the conditional mean curve is estimated by
regression spline with least square criterion. In (b) the nine curves are the estimated conditional
quantile curves at orders 0.1, 0.2, . . . , 0.9. The set of conditional quantile curves provides more
informative description of the relationship among variables such as non-constant variance or non-
normality of the noise (error) distribution. In this paper, we are concerned with the problem of
estimating these conditional quantiles.

2. Quantile Regression
Given the definition of qτ (x) and knowledge of support vector machines we might be tempted to use ver-
sion of the -insensitive tube regression to estimate qτ (x). More Specifically one might want to estimate
quantiles nonparametrically using an extension of the ν-trick, as outlined in Schölkopf et al. (2000). How-
ever this approach carries the disadvantage of requiring us to estimate both an upper and lower quantile
simultaneously. While this can be achieved by quadratic programming, in doing so we estimate “too many”
parameters simultaneously. More to the point, if we are interested in finding an upper bound on y which
holds with 0.95 probability we may not want to use information about the 0.05 probability bound in the
estimation. Following Vapnik’s paradigm of estimating only the relevant parameters directly Vapnik (1982)
we attack the problem by estimating each quantile separately. For completeness and comparison, we provide
a detailed description of a symmetric quantile regression in Appendix A.

2.1 Loss Function


The basic strategy behind quantile estimation arises from the observation that
Pminimizing the `1 -loss func-
m
tion for a location estimator yields the median. Observe that to minimize i=1 |yi − µ| by choice of µ,
an equal number of terms yi − µ have to lie on either side of zero in order for the derivative wrt. µ to van-
ish. Koenker (2005) generalizes this idea to obtain a regression estimate for any quantile by tilting the loss
function in a suitable fashion. More specifically one may show that the following “pinball” loss leads to
estimates of the τ -quantile:

1004
N ONPARAMTERIC Q UANTILE E STIMATION

(
τξ if ξ ≥ 0
lτ (ξ) = (2)
(τ − 1)ξ if ξ < 0

Lemma 2 (Quantile Estimator) Let Y = {y1 , . . . , ym } ⊂ R and let τ ∈ (0, 1) then the minimizer µτ of
P m
i=1 lτ (yi − µ) with respect to µ satisfies:

1. The number of terms, m− , with yi < µτ is bounded from above by τ m.


2. The number of terms, m+ , with yi > µτ is bounded from above by (1 − τ )m.
m−
3. For m → ∞, the fraction m , converges to τ if Pr(y) does not contain discrete components.

Proof Assume that we are at an optimal solution. Then, increasing the minimizer µ by δµ changes the ob-
jective by [(1 − m+ )(1 − τ ) − m+ τ ] δµ. Likewise, decreasing the minimizer µ by δµ changes the objective
by [−m− (1 − τ ) + (1 − m− )τ ] δµ. Requiring that both terms are nonnegative at optimality in conjunction
with the fact that m− + m+ ≤ m proves the first two claims. To see the last claim, simply note that the
event yi = yj for i 6= j has probability measure zero for distributions not containing discrete components.
Taking the limit m → ∞ shows the claim.
The idea is to use the same loss function for functions, f (x), rather than just constants in order to obtain
quantile estimates conditional on x. Koencker Koenker (2005) uses this approach to obtain linear estimates
and certain nonlinear spline models. In the following we will use kernels for the same purpose.

2.2 Optimization Problem


Based on lτ (ξ) we define the expected quantile risk as

R[f ] := Ep(x,y) [lτ (y − f (x))] . (3)

By the same reasoning as in Lemma 2 it follows that for f : X → R the minimizer of R[f ] is the quantile
µτ (x). Since p(x, y) is unknown and we only have X, Y at our disposal we resort to minimizing the
empirical risk plus a regularizer:
m
1 X λ
Rreg [f ] := lτ (yi − f (xi )) + kgk2H where f = g + b and b ∈ R. (4)
m i=1 2

Here k·kH is RKHS norm and we require g ∈ H. Notice that we do not regularize the constant offset, b, in
the optimization problem. This ensures that the minimizer of (4) will satisfy the quantile property:
Lemma 3 (Empirical Conditional Quantile Estimator) Assuming that f contains a scalar unregularized
term, the minimizer of (4) satisfies:
1. The number of terms m− with yi < f (xi ) is bounded from above by τ m.

1005
TAKEUCHI , L E , S EARS AND S MOLA

2. The number of terms m+ with yi > f (xi ) is bounded from above by (1 − τ )m.
3. If (x, y) is drawn iid from a distribution Pr(x, y), with Pr(y|x) continuous and the expectation of
the modulus of absolute continuity of its density satisfying limδ→0 E [(δ)] = 0. With probability 1,
asymptotically, mm− equals τ .
Proof For the two claims, denote by f ∗ the minimum of Rreg [f ] with f ∗ = g ∗ + b∗ . Then Rreg [g ∗ + b] has
to be minimal for b = b∗ . With respect to b, however, minimizing Rreg amounts to finding the τ quantile in
terms of yi − g(xi ). Application of Lemma 2 proves the first two parts of the claim.
For the second part, an analogous reasoning to (Schölkopf et al., 2000, Proposition 1) applies. In a nut-
shell, one uses the fact that the measure of the δ-neighborhood of f (x) converges to 0 for δ → 0. Moreover,
for kernel functions the entropy numbers are well behaved Williamson et al. (2001). The application of the
union bound over a cover of such function classes completes the proof. Details are omitted, as the proof is
identical to that in Schölkopf et al. (2000).
Later, in Section 4 we discuss finite sample size results regarding the convergence of mm− → τ and re-
lated quantities. These statements will make use of scale sensitive loss functions. Before we do that, let us
consider the practical problem of minimizing the regularized risk functional.

2.3 Dual Optimization Problem


Here we compute the dual optimization problem to (4) for efficient numerical implementation. Using the
connection between RKHS and feature spaces we write f (x) = hφ(x), wi + b and we obtain the following
equivalent to minimizing Rreg [f ].
m
X 1
minimize C τ ξi + (1 − τ )ξi∗ + kwk2 (5a)
(∗)
w,b,ξi i=1
2
subject to yi − hφ(xi ), wi − b ≤ ξi and hφ(xi ), wi + b − yi ≤ ξi∗ where ξi , ξi∗ ≥ 0 (5b)

Here we used C := 1/(λm). The dual of this problem can be computed straightforwardly using Lagrange
multipliers. The dual constraints for ξ and ξ ∗ can be combined into one variable. This yields the following
dual optimization problem
1 >
minimize α Kα − α> ~y subject to C(τ − 1) ≤ αi ≤ Cτ for all 1 ≤ i ≤ m and ~1> α = 0. (6)
α 2
We recover the f via the familiar kernel expansion
X X
w= αi φ(xi ) or equivalently f (x) = αi k(xi , x) + b. (7)
i i

Note that the constant b is the dual variable to the constraint ~1> α = 0. Alternatively, b can be obtained
by using the fact that f (xi ) = yi for αi 6∈ {C(τ − 1), Cτ }. The latter holds as a consequence of the
KKT-conditions on the primal optimization problem of minimizing Rreg [f ].
Note that the optimization problem is very similar to that of an -SV regression estimator Vapnik et al.
(1997). The key difference between the two estimation problems is that in -SVR we have an additional
kαk1 penalty in the objective function. This ensures that observations with deviations from the estimate,
i.e. with |yi − f (xi )| <  do not appear in the support vector expansion. Moreover the upper and lower

1006
N ONPARAMTERIC Q UANTILE E STIMATION

Figure 3: The data set measures accel-


eration in the head of a crash
test dummy v. time in tests
of motorcycle crashes. Three
regularized versions of the me-
dian regression estimate (τ =
0.5). While all three vari-
ants satisfy the quantile prop-
erty, the degree of smoothness
is controlled by the regulariza-
tion constant λ. All three es-
timates compare favorably to a
similar graph of nonlinear QR
estimates reported in Koenker
(2005).

constraints on the Lagrange multipliers αi are matched. This means that we balance excess in both direc-
tions. The latter is useful for a regression estimator. In our case, however, we obtain an estimate which
penalizes loss unevenly, depending on whether f (x) exceeds y or vice versa. This is exactly what we want
from a quantile estimator: by this procedure errors in one direction have a larger influence than those in the
converse direction, which leads to the shifted estimate we expect from QR.
The practical advantage of (6) is that it can be solved directly with standard quadratic programming code
rather than using pivoting, as is needed in SVM regression Vapnik et al. (1997). Figure 3 shows how QR
behaves subject to changing the model class, that is, subject to changing the regularization parameter. All
three estimates in Figure 3 attempt to compute the median subject to different smoothness constraints. While
they all satisfy the quantile property of having a fraction of τ = 0.5 points on either side of the regression,
they track the observations more or less closely. Therefore a practical estimate requires a procedure for
setting the regularization parameter. TThis question is taken up again in Section 5 where we compute
quantile regression estimates on a range of datasets.

3. Extensions and Modifications


The mathematical programming framework lends itself naturally to a series of extensions and modifications
of the regularized risk minimization framework for quantile regression. In the following we discuss some
extensions and modifications.

3.1 Non-crossing constraints


When we want to estimate several conditional quantiles (e.g. τ = 0.1, 0.2, . . . , 0.9), two or more estimated
conditional quantile functions can cross or overlap. This embarrassing phenomenon called quantile crossing
occurs because each conditional quantile function is independently estimated Koenker (2005); He (1997).
Figure 4(a) shows BMD data presented in 1.3.2 and τ = 0.1, 0, 2, . . . , 0.9 conditional quantile functions
estimated by the kernel-based estimator described in the previous section. Both of the input and the output
variables are standardized in [0, 1]. We note quantile crossings at several places, especially at the outside

1007
TAKEUCHI , L E , S EARS AND S MOLA

of the training data range (x < 0 and 1 < x). In this subsection, we address this problem by introducing
non-crossing constraints. Figure 4(b) shows a family of conditional quantile functions estimated with the
non-crossing constraints.
Suppose that we want to estimate n conditional quantiles at 0 < τ1 < τ2 < . . . < τn < 1. We enforce
non-crossing constraints at l points {xj }lj=1 in the input domain X. Let us write the model for the τh -th
conditional quantile function as fh (x) = hφ(x), wh i + bh for h = 1, 2, . . . , n. In H the non-crossing
constraints are represented as linear constraints

hφ(xj ), ωh i + bh ≤ hφ(xj ), ωh+1 i + bh+1 , for all 1 ≤ h ≤ n − 1, 1 ≤ j ≤ l. (8)

Solving (5) or (6) for 1 ≤ h ≤ n with non-crossing constraints (8) allows us to estimate n conditional
quantile functions not crossing at l points x1 , . . . , xl ∈ X. The primal optimization problem is given by
n h X m
X
∗ 1 i
minimize C τh ξhi + (1 − τh )ξhi + kwh k2 (9a)
(∗)
wh ,bh ,ξhi i=1
2
h=1
∗ ∗
subject to yi − hφ(xi ), wh i − bh = ξhi − ξhi where ξhi , ξhi ≥ 0, for all 1 ≤ h ≤ n, 1 ≤ i ≤ m. (9b)
{hφ(xj ), ωh+1 i + bh+1 } − {hφ(xj ), ωh i + bh } ≥ 0, for all 1 ≤ h ≤ n − 1, 1 ≤ j ≤ l. (9c)

Using Lagrange multipliers, we can obtain the dual optimization problem:


n
X 1 1
αh> Kαh + αh> K̃(θh−1 − θh ) + (θh−1 − θh )T K̄(θh−1 − θh ) − αh> ~y

minimize (10a)
αh ,θh 2 2
h=1
subject to C(τh − 1) ≤ αhi ≤ Cτh , for all 1 ≤ h ≤ n, 1 ≤ i ≤ m, (10b)
θhj ≥ 0, for all 1 ≤ h ≤ n, 1 ≤ j ≤ l, ~1> αh = 0, for all 1 ≤ h ≤ n, (10c)

where θhj is the Lagrange multiplier of (9c) for all 1 ≤ h ≤ n, 1 ≤ j ≤ l, K̃ is m × l matrix with its (i, j)-th
entry k(xi , xj ), K̄ is l × l matrix with its (j1 , j2 )-th entry k(xj1 , xj2 ) and θh is l-vector with its j-th entry
θhj for all 1 ≤ h ≤ n. For notational convenience we define θ0j = θnj = 0 for all 1 ≤ j ≤ l. The model
for conditional quantile τh -th quantile function is now represented as
m
X l
X
fh (x) = αhi k(x, xi ) + (θh−1i − θhi )k(x, xj ) + bh . (11)
i=1 j=1

In section 5.2.1 we empirically investigate the effect of non-crossing constraints on the generalization per-
formances.

3.2 Monotonicity and Growth Curves


Consider the situation of a health statistics office which wants to produce growth curves. That is, it wants
to generate estimates of y being the height of a child given parameters x such as age, ethnic background,
gender, parent’s height, etc. Such curves can be used to assess whether a child’s growth is abnormal.
A naive approach is to apply QR directly to the problem of estimating y|x. Note, however, that we
have additional information about the biological process at hand: the height of every individual child is a
monotonically increasing function of age. Without observing large amounts of data, there is no guarantee
that the estimates f (x), will also be monotonic functions of age.

1008
N ONPARAMTERIC Q UANTILE E STIMATION

(a) Without non-crossing constraints (b) With non-crossing constraints

Figure 4: An example of quantile crossing problem in BMD data set presented in Section 1. Both of the
input and the output variable are standardized in [0, 1]. In (a) the set of conditional quantiles at
0.1, 0.2, . . . , 0.9 are estimated by the kernel-based estimator presented in the previous section.
Quantile crossings are found at several points, especially at the outside of the training data range
(x < 0 and 1 < x). The plotted curves in (b) are the conditional quantile functions obtained
with non-crossing constraints explained in Section 3.1. There are no quantile crossing even at the
outside of the training data range.

To address this problem we adopt an approach similar to Vapnik et al. (1997); Smola and Schölkopf
(1998) and impose constraints on the derivatives of f directly. While this only ensures that f is mono-
tonic on the observed data X, we could always add more locations x0i for the express purpose of enforcing
monotonicity.
Formally, we require that for a differential operator D, such as D = ∂xage the estimate Df (x) ≥ 0 for
all x ∈ X. Using the linearity of inner products we have

Df (x) = D (hφ(x), wi + b) = hDφ(x), wi = hψ(x), wi where ψ(x) := Dφ(x). (12)

Note that accordingly inner products between ψ and φ can be obtained via hψ(x), φ(x0 )i = D1 k(x, x0 ) and
hψ(x), ψ(x0 )i = D1 D2 k(x, x0 ), where D1 and D2 denote the action of D on the first and second argument
of k respectively. Consequently the optimization problem (5) acquires an additional set of constraints and
we need to solve
m
X 1
minimize C τ ξi + (1 − τ )ξi∗ + kwk2 (13)
w,b,ξi
i=1
2
subject to yi − hφ(xi ), wi − b ≤ ξi , hφ(xi ), wi + b − yi ≤ ξi∗ and hψ(xi ), wi ≥ 0 where ξi , ξi∗ ≥ 0.

1009
TAKEUCHI , L E , S EARS AND S MOLA

Figure 5: Example plots from quantile regression with and without monotonicity constraints. The thin line
represents the nonparametric quantile regression without monotonicity constraints whereas the
thick line represents the nonparamtric quantile regression with monotonicity constraints.

Since the additional constraint does not depend on b it is easy to see that the quantile property still holds.
The dual optimization problem yields
 >   
1 α K D1 K α
minimize − α> ~y (14a)
α,β 2 β D2 K D1 D2 K β
subject to C(τ − 1) ≤ αi ≤ Cτ and 0 ≤ βi for all 1 ≤ i ≤ m and ~1> α = 0. (14b)

Here D1 KP
is a shorthand for the matrix of entries D1 k(xi ,P
xj ) and D2 K, D1 D2 K are defined analogously.
Here w = i αi φ(xi ) + βi ψ(xi ) or equivalently f (x) = i αi k(xi , x) + βi D1 k(xi , x) + b.
Example Assume that x ∈ Rn and that x1 is the coordinate with respect to which we wish to enforce
monotonicity. Moreover, assume that we use a Gaussian RBF kernel, that is
 
1 2
k(x, x0 ) = exp − 2 kx − x0 k . (15)

1010
N ONPARAMTERIC Q UANTILE E STIMATION

In this case D1 = ∂1 with respect to x and D2 = ∂1 with respect to x0 . Consequently we have

x01 − x1 x1 − x01
D1 k(x, x0 ) = 2
k(x, x0
); D 2 k(x, x0
) = k(x, x0 ) (16a)
" σ # σ2
2
(x1 − x01 )
D1 D2 k(x, x0 ) = σ −2 − k(x, x0 ). (16b)
σ4

Plugging the values of (16) into (14) yields the quadratic program. Note also that both k(x, x0 ) and D1 k(x, x0 )
((16a)), are used in the function expansion.
If x1 were drawn from a discrete (yet ordered) domain we could replace D1 , D2 with a finite difference
operator. This is still a linear operation on k and consequently the optimization problem remains unchanged
besides a different functional form for D1 k.

3.3 Other Function Classes


Semiparametric Estimates RKHS expansions may not be the only function classes desired for quantile
regression. For instance, in the social sciences a semiparametric model may be more desirable, as it allows
for interpretation of the linear coefficients (Gu and Wahba (1993); Smola et al. (1999); Bickel et al. (1994)).
In this case we add a set of parametric functions fi and solve
m n
1 X λ X
minimize lτ (yi − f (xi )) + kgk2H where f (x) = g(x) + βi fi (x) + b. (17)
m i=1 2 i=1

For instance, the function class fi could be linear coordinate functions, that is, fi (x) = xi . The main
difference to (6) is that the resulting optimization problem exhibits a larger number of equality constraint.
We obtain (6) with the additional constraints
m
X
αj fi (xj ) = 0 for all i. (18)
j=1

Linear Programming Regularization Convex function classes with `1 penalties can be obtained by im-
2
posing an kαk1 penalty instead of the kgkH penalty in the optimization problem. The advantage of this
setting is that minimizing
m n n
1 X X X
minimize lτ (yi − f (xi )) + λ |αi | where f (x) = αi fi (x) + b. (19)
m i=1 j=1 i=1

is a linear program which can be solved efficiently by existing codes for large scale problems. In the context
of (19) the functions fi constitute the generators of the convex function class. This approach is similar to
Koenker and Park (1996) and Bosch et al. (1995). The former discuss `1 regularization of expansion coeffi-
cients whereas the latter discuss an explicit second order smoothing spline method for the purpose of quantile
regression. Most of the discussion in the present paper can be adapted to this case without much modifica-
tion. For details on how to achieve this see Schölkopf and Smola (2002). Note that smoothing splines are a
special instance of kernel expansions where one assumes explicit knowledge of the basis functions.

1011
TAKEUCHI , L E , S EARS AND S MOLA

Relevance Vector Regularization and Sparse Coding Finally, for sparse expansions one can use more
aggressive penalties on linear function expansions than those given in (19). For instance, we could use
a staged regularization as in the RVM (Tipping (2001)), where a quadratic penalty on each coefficient is
exerted with a secondary regularization on the penalty itself. This corresponds to a Student-t penalty on α.
Likewise we could use a mix between an `1 and `0 regularizer as used in Fung et al. (2002) and apply
successive linear approximation. In short, there exists a large number of regularizers, and (non)parametric
families which can be used. In this sense the RKHS parameterization is but one possible choice. Even so,
we show in Section 5 that QR using the RKHS penalty yields excellent performance in experiments.

4. Theoretical Analysis
4.1 Performance Indicators
In this section we state some performance bounds for our estimator. For this purpose we first need to discuss
how to evaluate the performance of the estimate f versus the true conditional quantile µτ (x). Two criteria
are important for a good quantile estimator fτ :
• fτ needs to satisfy the quantile property as well as possible. That is, we want that
Pr {| Pr {y < fτ (x)} − τ | ≥ } ≤ δ. (20)
X,Y

In other words, we want that the probability that y < fτ (x) does not deviate from τ by more than 
with high probability, when viewed over all draws (X, Y ) of training data. Note however, that (20)
does not imply having a conditional quantile estimator at all. For instance, the constant function based
on the unconditional quantile estimator with respect to Y performs extremely well under this criterion.
Hence we need a second quantity to assess how closely fτ (x) tracks µτ (x).
• Since µτ itself is not available, we take recourse to (3) and the fact that µτ is the minimizer of the
expected risk R[f ]. While this will not allow us to compare µτ and fτ directly, we can at least compare
it by assessing how close to the minimum R[fτ∗ ] the estimate R[fτ ] is. Here fτ∗ is the minimizer of
R[f ] with respect to the chosen function class. Hence we will strive to bound
Pr {R[fτ ] − R[fτ∗ ] > } ≤ δ. (21)
X,Y

These statements will be given in terms of the Rademacher complexity of the function class of the estimator
as well as some properties of the loss function used in select it. The technique itself is standard and we believe
that the bounds can be tightened considerably by the use of localized Rademacher averages Mendelson
(2003), or similar tools for empirical processes. However, for the sake of simplicity, we use the tools from
Bartlett and Mendelson (2002), as the key point of the derivation is to describe a new setting rather than a
new technique.

4.2 Bounding R[fτ∗ ]


Definition 4 (Rademacher Complexity) Let X := {x1 , . . . , xm } be drawn iid from p(x) and let F be a
class of functions mapping from (X) to R. Let σi be independent uniform {±1}-valued random variables.
Then the Rademacher complexity Rm and its empirical variant R̂m are defined as follows:
h 2 Xn i h i
R̂m (F) := Eσ sup σi f (xi ) X and Rm (F) := EX R̂m (F) . (22)

f ∈F m 1

1012
N ONPARAMTERIC Q UANTILE E STIMATION

Conveniently, if Φ is a Lipschitz continuous function with Lipschitz constant L, one can show Bartlett and
Mendelson (2002) that

Rm (Φ ◦ F) ≤ 2LRm (F) where Φ ◦ F := {g|g = φ ◦ f and f ∈ F} . (23)

An analogous result exists for empirical quantities bounding R̂m (Φ ◦ F) ≤ 2LR̂m (F). The combination of
(23) with (Bartlett and Mendelson, 2002, Theorem 8) yields:
Theorem 5 (Concentration for Lipschitz Continuous Functions) For any Lipschitz continuous function
Φ with Lipschitz constant L and a function class F of real-valued functions on X and probability measure
on X the following bound holds with probability 1 − δ for all draws of X from X:
m
r
1 X 8 log 2/δ
sup Ex [Φ(f (x))] − Φ(f (xi )) ≤ 2LRm (F) + . (24)

f ∈F m i=1
m

We can immediately specialize the theorem to the following statement about the loss for QR:

Theorem 6 Denote by fτ∗ the minimizer of the R[f ] with respect to f ∈ F. Moreover assume that all f ∈ F
are uniformly bounded by some constant B. With the conditions listed above for any sample size m and
0 < δ < 1, every quantile regression estimate fτ satisfies with probability at least (1 − δ)
r
∗ log 2/δ
R[fτ ] − R[fτ ] ≤ 2 max LRm (F) + (4 + LB) where L = {τ, 1 − τ } . (25)
2m
Proof We use the standard bounding trick that

R [fτ ] − R [fτ∗ ] ≤ |R [fτ ] − Remp [fτ ]| + Remp [fτ∗ ] − R [fτ∗ ] (26)


≤ sup |R [f ] − Remp [f ]| + Remp [fτ∗ ] − R [fτ∗ ] (27)
f ∈F

where (26) follows from Remp [fτ ] ≤ Remp [fτ∗ ]. The first term can be bounded directly by Theorem 5.
For the second part we use Hoeffding’s bound Hoeffding (1963)
q which states that the deviation between a
bounded random variable and its expectation is bounded by B log2m 1/δ
with probability δ. Applying a union
bound argument for the two terms with probabilities 2δ/3 and δ/3 yields the confidence-dependent term.
Finally, using the fact that lτ is Lipschitz continuous with L = max(τ, 1 − τ ) completes the proof.

Example Assume that H is an RKHS with radial basis function kernel k for which k(x, x) = 1. Moreover
assume that for all f ∈ F we have kf kH ≤ C. In this case it follows from Mendelson (2003) that Rm (F) ≤
2C

m
. This means that the bounds of Theorem 6 translate into a rate of convergence of
1
R [fτ ] − R [fτ∗ ] = O(m− 2 ). (28)

This is as good as it gets for nonlocalized estimates. Since we do not expect R[f ] to vanish except for patho-
logical applications where quantile regression is inappropriate (that is, cases where we have a deterministic
dependency between y and x), the use of localized estimates Bartlett et al. (2002) provides only limited re-
turns. We believe, however, that the constants in the bounds could benefit from considerable improvement.

1013
TAKEUCHI , L E , S EARS AND S MOLA

r+ (ξ) := min {1, max {0, 1 − ξ/}} (29a)


r− (ξ) := min {1, max {0, −ξ/}} (29b)

Figure 6: Ramp functions bracketing bracketing the characteristic function via r+ ≥ χ(−∞,0] ≥ r− .

4.3 Bounds on the Quantile Property


The theorem of the previous section gave us some idea about how far the sample average quantile loss is
from its true value under p. We now proceed to stating bounds to which degree fτ satisfies the quantile
property, i.e. (20).  
In this view (20) is concerned with the deviation E χ(−∞,0] (y − fτ (x)) −τ . Unfortunately χ(−∞,0] ◦F
is not scale dependent. In other words, small changes in fτ (x) around the point y = fτ (x) can have large
impact on (20). One solution for this problem is to use an artificial margin  and ramp functions r+ , r− as
defined in Figure 6. These functions are Lipschitz continuous with constant L = 1/. This leads to:

Theorem 7 Under the assumptions of Theorem 6 the expected quantile is bounded with probability 1 − δ
each from above and below by
m m
1 X −   1 X +
r (yi − f (xi )) − ∆ ≤ E χ(−∞,0] (y − fτ (x)) ≤ r (yi − f (xi )) + ∆, (30)
m i=1  m i=1 
q
−8 log δ
where the statistical confidence term is given by ∆ = 2 Rm (F) + m .

Proof The claim follows directly from Theorem 5 and the Lipschitz continuity of r+ and r− . Note that r+
and r− minorize and majorize ξ(−∞,0] , which bounds the expectations. Next use a Rademacher bound on
the class of loss functions induced by r+ ◦ F and r− ◦ F and note that the ramp loss has Lipschitz constant
L = 1/. Finally apply the union bound on upper and lower deviations.
Note that Theorem 7 allows for some flexibility: we can decide to use a very conservative bound in terms of
, i.e. a large value of  to reap the benefits of having a ramp function with small L. This leads to a lower
bound on the Rademacher average of the induced function class. Likewise, a small  amounts to a potentially
tight approximation of the empirical quantile, while risking loose statistical confidence terms.

5. Experiments
5.1 Experiments with standard nonparametric quantile regression
The present section mirrors the theoretical analysis of the previous section. We check the performance of
various quantile estimators with respect to two criteria:
• Expected risk with respect to the `τ loss function. Since computing the true conditional quantile is
impossible and all approximations of the latter rely on intermediate density estimation Koenker (2005)
this is the only objective criterion we could find.

1014
N ONPARAMTERIC Q UANTILE E STIMATION

• Simultaneously we need to ensure that the estimate satisfies the quantile property, that is, we want
to ensure that the estimator we obtained does indeed produce numbers fτ (x) which exceed y with
probability close to τ .

5.1.1 M ODELS
We compare the following four models:
• An unconditional quantile estimator. Given the simplicity of the function class (constants!) this model
should tend to underperform all other estimates in terms of minimizing the empirical risk. By the same
token, it should perform best in terms of preserving the quantile property.
• Linear QR as described in Koenker (2005). This uses the a linear unregularized model to minimize lτ .
In experiments, we used the rq routine available in the R2 package called quantreg.
• Nonparametric QR as described by Koenker (2005) (Ch. 7). This uses a spline model for each coordi-
nate individually, with linear effect. The fitting routine used was rqss, also available in quantreg.3
• Nonparametric quantile regression as described in Section 2. We used Gaussian RBF kernels with au-
tomatic kernel width (ω 2 ) and regularization (C) adjustment by 10-fold cross-validation. This appears
as nprq.4
As we increase the complexity of the function class (from constant to linear to nonparametric) we expect
that (subject to good capacity control) the expected risk will decrease. Simultaneously we expect that the
quantile property becomes less and less maintained, as the function class grows. This is exactly what one
would expect from Theorems 6 and 7. As the experiments show, the npqr method outperforms all other
estimators significantly in most cases. Moreover, it compares favorably in terms of preserving the quantile
property.

5.1.2 DATASETS
We chose 20 regression datasets from the following R packages: mlbench, quantreg, alr3 and
MASS. The first library contains datasets from the UCI repository. The last two were made available as
illustrations for regression textbooks. The data sets are all documented and available in R. Data sets were
chosen not to have any missing variables, to have suitable datatypes, and to be of a size where all models
would run on them. 5 In most cases either there was an obvious variable of interest, which was selected
as the y-variable, or else we chose a continuous variable arbitrarily. The sample sizes vary from m = 38
(CobarOre) to m = 1375 (heights), and the number of regressors vary from d = 1 (5 sets) and d =
12 (BostonHousing). Some of the data sets contain categorical variables. We omitted variables which
were effectively record identifiers, or obviously produced very small groupings of records. Finally, we
standardized all datasets coordinatwise to have zero mean and unit variance before running the algorithms.
This had a side benefit of putting the pinball loss on similar scale for comparison purposes.
2. See http://cran.r-project.org/
3. Additional code containing bugfixes and other operations necessary to carry out our experiments is available at
http://users.rsise.anu.edu.au/∼timsears.
4. Code will be available as part of the CREST toolbox for research purposes.
5. The last requirement, using rqss proved to be challenging. The underlying spline routines do not allow extrapolation beyond
the previously seen range of a coordinate, only permitting interpolation. This does not prevent fitting, but does randomly prevent
forecasting on unseen examples, which was part of our performance metric.

1015
TAKEUCHI , L E , S EARS AND S MOLA

Data Set Sample Size No. Regressors (x) Y Var. Dropped Vars.
caution 100 2 y -
ftcollinssnow 93 1 Late YR1
highway 39 11 Rate -
heights 1375 1 Dheight -
sniffer 125 4 Y -
snowgeese 45 4 photo -
ufc 372 4 Height -
birthwt 189 7 bwt ftv, low
crabs 200 6 CW index
GAGurine 314 1 GAG -
geyser 299 1 waiting -
gilgais 365 8 e80 -
topo 52 2 z -
BostonHousing 506 13 medv -
CobarOre 38 2 z -
engel 235 1 y -
mcycle 133 1 accel -
BigMac2003 69 9 BigMac City
UN3 125 6 Purban Locality
cpus 209 7 estperf name

Table 1: Dataset facts

1016
N ONPARAMTERIC Q UANTILE E STIMATION

5.1.3 R ESULTS
We tested the performance of the 4 algorithms on 3 different quantiles (τ ∈ {0.1, 0.5, 0.9}). For each model
we used 10-fold cross-validation to assess the confidence of our results. For the npqr model, kernel width
and smoothness parameters were automatically chosen by cross-validation within the training sample. We
performed 10 runs on the training set to adjust parameters, then chose the best parameter setting based on
the pinball loss averaged over 10 splits. To compare across all four models we measured both pinball loss
and quantile performance.
The full results are shown in Appendix B. The 20 data sets and three quantile levels yield 60 trials for
each model. In terms of pinball loss averaged across 10 tests the npqr model performed best or tied on
51 of the 60 trials, showing the clear advantage of the proposed method. The results are consistent across
quantile levels. We can get another impression of performance by looking at the loss in each of the 10 test
runs that enter each trial. This is depicted in Figure 5.1.3. In a large majority of test cases the npqr model
error is smaller than that of the other models, resulting in a “cloud” centered below the 45 degree line.
Moreover, the quantile properties of all four methods are comparable. All four models produced ramp
losses close to the desired quantile, although the rqss and npqr models were noisier in this regard. The
complete results for the ramp loss are presented in last three tables in Appendix B. A slight downward bias
seen in all models is reviewed in the Discussion.

5.2 Experiments on nonparametric quantile regression with additional constraints


We empirically investigate the performances of nonparametric quantile regression estimator with the addi-
tional constraints described in section 3. Imposing constraints is one way to introduce the prior knowledge
on the data set being analyzed. Although additional constraints always increase training errors, we will see
that these constraints can sometimes reduce test errors.

5.2.1 N ON - CROSSING CONSTRAINTS


First we look at the effect of non-crossing constraints on the generalization performances. We used the same
20 data sets mentioned in the previous subsection using only the npqr model. We denote the npqrs trained
with non-crossing constraints as noncross and npqr indicates standard one here. We made comparisons
between npqr and noncross with τ ∈ {0.1, 0.5, 0.9}. The results for noncross with τ = 0.1 were
obtained by training a pair of non-crossing models with τ = 0.1 and 0.2. The results with τ = 0.5 were
obtained by training three non-crossing models with τ = 0.4, 0.5 and 0.6. The results with τ = 0.9 were
obtained by training a pair of non-crossing models with τ = 0.8 and 0.9. In this experiment, we simply
impose non-crossing constraints only at a single test point to be evaluated. The kernel width and smoothing
parameter were always set to be the selected ones in the above standard npqr experiments. The confidences
were assessed by 10-fold cross-validation in the same way as the previous section. The complete results are
found in the tables in Appendix B. The performances of npqr and noncross are quite similar since npqr
itself could produce almost non-crossing estimates and the constraints only make a small adjustments only
when there happen to be the violations.

5.2.2 M ONOTONICITY CONSTRAINTS


We compare two models:
• Nonparametric QR as described in Section 2 (npqr).
• Nonparametric QR with monotonicity constraints as described in Section 3.2 (npqrm).

1017
TAKEUCHI , L E , S EARS AND S MOLA

Figure 7: A log-log plot of out-of-sample performance obtained from the cross validation runs. The plots
show npqr versus uncond, linear, and rqss; combining the values from all three estimated quantiles.
Each point below the 45-degree line represents a case where the npqr achieves a better loss that
the alternative and vice versa. The location of the cloud provides an impression of the relative
generalization performance of each pair of models.

We use two datasets:


• The cars dataset as described in Mammen et al. (2001). Fuel efficiency (in miles per gallon) is studied
as a function of engine output.

1018
N ONPARAMTERIC Q UANTILE E STIMATION

• The onions dataset as described in Ruppert and Carroll (2003). log(Yield) is studied as a function of
density, we use only the measurements taken at Purnong Landing.
We tested the performance of the two methods on 3 different quantiles (τ ∈ {0.1, 0.5, 0.9}). In the exper-
iments with cars, we noticed that the data is not truly monotonic. Monotonic models (npqrm) tend to do
worse than standard models (npqr) for lower quantiles. With higher quantiles, npqrm tends to do better
than the standard npqr.
For the onions dataset, as the data is truly monotonic the npqrm does better than the standard npqr
in terms of the pinball loss.

6. Discussion and Extensions


Frequently in the literature of regression, including quantile regression, we encounter the term “exploratory
data analysis”. This is meant to describe a phase before the user has settled on a “model”, after which some
statistical tests are performed, justifying the choice of the model. Quantile regression, which allows the user
to highlight many aspects of the distribution, is indeed a useful tool for this type of analysis. We also note
that no attempts at statistical modeling beyond automatic parameter choice via cross-validation, were made
to tune the results. So the effort here stays true to that spirit, yet may provide useful estimates immediately.
In the Machine Learning literature the emphasis is more on short circuiting many aspects of the modeling
process. While not truly model-free, the experience of comparing the models in this paper shows how easy
it is to estimate the quantities of interest in QR, without any of the angst of model selection. It is interesting
to consider whether kernel methods, with proper regularization, are a good substitute for some traditional
modeling activity. In particular we were able to some simpler traditional statistical estimates significantly,
which allows the human modeler to focus on statistical concerns at a higher level.
In summary, we have presented a Quadratic Programming method for estimating quantiles which bests
the state of the art in statistics. It is easy to implement, we provided uniform convergence results and
experimental evidence for its soundness. We also introduce non-crossing and monotonicity constraints as
extensions to avoid embarassing behaviours of the model when doing quantile regression.
Overly Optimistic Estimates for Ramp Loss The experiments show us that the there is a bias towards
the median in terms of the ramp loss. For example, if we run a quantile estimator for at 0.05, then we will
not necessarily get the empirical quantile is also at 0.05 but more likely to be at 0.08 or higher. Likewise,
the empirical quantile will be 0.93 or lower if the estimator is run at 0.9. This affects all estimators, using
the pinball loss as the loss function, not just the kernel version.
This is because the algorithm tends to aggressively push a number of points to the kink in the training
set, these points will then be miscounted. However, in the testing set the it is very unlikely to get the points
lying exactly at the kink. Figure 8 shows us there is a linear relationship between the fraction of points at
and below the kink (for low quantiles) and below the kink (for higher quantiles) with the empirical ramp
loss.
Accordingly, to get a better performance in terms of the ramp loss, we just estimate the quantiles, and
if they turn out to be too optimistic on the training set, we use a slightly lower (for τ < 0.5) or higher (for
τ > 0.5) value of τ until we have exactly the right quantity.
The fact that there is a number of points sitting exactly on the kink (quantile regression - this paper), the
edge of the tube (ν-SVR - see Schölkopf et al. (2000)), or the supporting hyperplane (single-class problems
and novelty detection - see Schölkopf et al. (1999)) might affect the overall accuracy control in the test set
has not been carefully studied so far thus needs more attention from the community.

1019
TAKEUCHI , L E , S EARS AND S MOLA

Figure 8: Illustration of the relationship between quantile in training and ramp loss.

Estimation with constraints We introduce non-crossing and monotonicity constraints in the context of
nonparametric quantile regression. However, as discussed in Mammen et al. (2001), other constraints can
also be applied very similiarly to the constraints described in this paper but might be in different estimation
contexts. Here are some variations:
• Boundary conditions. The regression function is defined in [a, b] and assumed to be v at the boundary
point a or b.
• Additive models with monotone components. The regression function f : Rn → R is of additive form
f (x1 , ..., xn ) = f1 (x1 ) + ... + fn (xn ) where each additive component fi is monotonic.
• Observed deriatives. Assume that m samples are observed corresponding with m regression functions.
Now, the constraint is that fj coincides with the derivative of fj−1 (same notation with last point) Cox
(1988).
• Bivariate extreme-value distributions. See Hall and Tajvidi (2000).
• Positivity constraints. The regression function is positive.
Future Work Quantile regression has been mainly used as a data analysis tool to assess the influence
of individual variables. This is an area where we expect that nonparametric estimates will lead to better
performance.
Being able to estimate an upper bound on a random variable y|x which hold with probability τ is useful
when it comes to determining the so-called Value at Risk of a portfolio. Note, however, that in this situation
we want to be able to estimate the regression quantile for a large set of different portfolios. For example,
an investor may try to optimize their portfolio allocation to maximize return while keeping risk within a
constant bound. Such uniform statements will need further analysis if we are to perform nonparametric
estimates. We need more efficient optimization algorithm for non-crossing constraints since we have to
work with O(nm) dual variables. Simple SVM Vishwanathan et al. (2003) would be the promising candidate
for this purpose.

1020
N ONPARAMTERIC Q UANTILE E STIMATION

Acknowledgments National ICT Australia is funded through the Australian Government’s Backing Aus-
tralia’s Ability initiative, in part through the Australian Research Council. This work was supported by
grants of the ARC, by the Pascal Network of Excellence and by Japanese Grants-in-Aid for Scientific Re-
search 16700258. We thank Roger Koenker for providing us with the latest version of the R package
quantreg, and for technical advice. We thank Shahar Mendelson and Bob Williamson for useful discus-
sions and suggestions.

References
L.K. Bachrach, T. Hastie, M.C. Wang, B. Narashimhan, and R. Marcus. Bone mineral acquisition in
healthy asian, hispanic, black and caucasian youth, a longitudinal study. Journal of Clinical Endocrinal
Metabolism, 84:4702– 4712, 1999.
P.L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher averages. In Proceedings of the 15th
conference on Computational Learning Theory COLT’02, pages 44–58, 2002.
P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.
Journal of Machine Learning Research, 3:463–482, 2002.
P. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and adaptive estimation for semipara-
metric models. J. Hopkins Press, Baltimore, ML, 1994.
R.J. Bosch, Y.Ye, and G.G.Woodworth. A convergent algorithm for quantile regression with smoothing
splines. Computational Statistics and Data Analysis, 19:613–630, 1995.
D.D. Cox. Approximation of method of regularization estimators. Annals of Statistics, 1988.
G. Fung, O. L. Mangasarian, and A. J. Smola. Minimal kernel classifiers. Journal of Machine Learning
Research, 3:303–321, 2002.
C. Gu and G. Wahba. Semiparametric analysis of variance with tensor product thin plate splines. Journal of
the Royal Statistical Society B, 55:353–368, 1993.
P. Hall and N. Tajvidi. Distribution and dependence-function estimation for bivariate extreme-value distri-
butions. Bernoulli, 2000.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2001.
X. He. Quantile curves without crossing. The American Statistician, 51(2):186–192, may 1997. URL
http://www.amstat.org/publications/tas/abstracts/he.html.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association, 58:13–30, 1963.
A. E. Hoerl and R. W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Techno-
metrics, 12:55–67, 1970.
P. J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.
R. Koenker. Quantile Regression. Cambridge University Press, 2005.

1021
TAKEUCHI , L E , S EARS AND S MOLA

R. Koenker and B.J. Park. An interior point algorithm for nonlinear quantile regression. Journal of Econo-
metrics, 71:265–283, 1996.
E. Mammen, J.S. Marron, B.A. Turlach, and M.P. Wand. A general projection framework for constrained
smoothing. Statistical Science, 16(3):232–248, August 2001.

S. Mendelson. A few notes on statistical learning theory. In S. Mendelson and A. J. Smola, editors, Advanced
Lectures on Machine Learning, number 2600 in LNAI, pages 1–40. Springer, 2003.
D. Ruppert and R.J. Carroll. Semiparametric Regression. Wiley, 2003.
B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural
Computation, 12:1207–1245, 2000.
B. Schölkopf, R. C. Williamson, A. J. Smola, and J. Shawe-Taylor. Single-class support vector machines.
In J. Buhmann, W. Maass, H. Ritter, and N. Tishby, editors, Unsupervised Learning, Dagstuhl-Seminar-
Report 235, pages 19–20, 1999.
A. J. Smola, T. Frieß, and B. Schölkopf. Semiparametric support vector and linear programming machines.
In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems
11, pages 585–591, Cambridge, MA, 1999. MIT Press.
A. J. Smola and B. Schölkopf. On a kernel-based method for pattern recognition, regression, approximation
and operator inversion. Algorithmica, 22:211–231, 1998.
M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning
Research, 1:211–244, 2001.
V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
V. Vapnik, S. Golowich, and A. J. Smola. Support vector method for function approximation, regression
estimation, and signal processing. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in
Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.
V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Berlin, 1982.
H. D. Vinod. A survey of ridge regression and related techniques for improvements over ordinary least
squares. Review of Economics and Statistics, 60:121–131, February 1978.
S. V. N. Vishwanathan, A. J. Smola, and M.N. Murty. SimpleSVM. In
Proc. of the International Conference on Machine Learning (ICML), 2003. URL
http://users.rsise.anu.edu.au/ vishy/papers/VisSmoMur03.pdf.
G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in
Applied Mathematics. SIAM, Philadelphia, 1990.
R. C. Williamson, A. J. Smola, and B. Schölkopf. Generalization bounds for regularization networks and
support vector machines via entropy numbers of compact operators. IEEE Transaction on Information
Theory, 47(6):2516–2532, 2001.

1022
N ONPARAMTERIC Q UANTILE E STIMATION

Appendix A. Nonparametric ν-Support Vector Regression


In this section we explore an alternative to the quantile regression framework proposed in Section 2. It
derives from Schölkopf et al. (2000). There the authors suggest a method for adapting SV regression and
classification estimates such that automatically only a quantile ν lies beyond the desired confidence region.
In particular, if p(y|x) can be modeled by additive noise of equal degree (i.e. y = f (x) + ξ where ξ is
a random variable independent of x) Schölkopf et al. (2000) show that the ν-SV regression estimate does
converge to a quantile estimate.

A.1 Heteroscedastic Regression


Whenever the above assumption on p(y|x) is violated ν-SVR will not perform as desired. This problem
can be amended as follows: one needs to turn (x) into a nonparametric estimate itself. This means that we
solve the following optimization problem.
m
λ1 λ2 X
minimize kθ1 k2 + kθ2 k2 + (ξi + ξi∗ ) − νm (31a)
θ1 ,θ2 ,b, 2 2 i=1
subject to hφ1 (xi ), θ1 i + b − yi ≤  + hφ2 (xi ), θ2 i + ξi (31b)
yi − hφ1 (xi ), θ1 i − b ≤  + hφ2 (xi ), θ2 i + ξi∗ (31c)
ξi , ξi∗ ≥ 0 (31d)
Here φ1 , φ2 are feature maps, θ1 , θ2 are corresponding parameters, ξi , ξi∗ are slack variables and b,  are
scalars. The key difference to the heteroscedastic estimation problem described in Schölkopf et al. (2000) is
that in the latter the authors assume that the specific form of the noise is known. In (31) instead, we make no
such assumption and instead we estimate (x) as hφ2 (x), θ2 i + .
By Lagrange multiplier methods one may check that the dual of (31) is obtained by
1 1
minimize (α − α∗ )> K1 (α − α∗ ) + (α + α∗ )> K1 (α + α∗ ) + (α − α∗ )> y (32a)

α,α 2λ1 2λ2
subject to ~1> (α − α∗ ) = 0 (32b)
~1> (α + α∗ ) = Cmν (32c)
0 ≤ αi , αi∗ ≤ 1 for all 1 ≤ i ≤ m (32d)

Here K1 , K2 are kernel matrices where [Ki ]jl = ki (xj , xl ) and ~1 denotes the vector of ones. Moreover, we
have the usual kernel expansion, this time for the regression f (x) and the margin (x) via
m
X m
X
f (x) = (αi − αi∗ ) k1 (xi , x) + b and (x) = (αi + αi∗ ) k2 (xi , x) + . (33)
i=1 i=1

The scalars b and  can be computed conveniently as dual variables of (32) when solving the problem with
an interior point code.

A.2 The ν-Property


As in the parametric case also (31) has the ν-property. However, it is worth noting that the solution (x) need
not be positive throughout unless we change the optimization problem slightly by imposing a nonnegativity
constraint on . The following theorem makes this reasoning more precise:

1023
TAKEUCHI , L E , S EARS AND S MOLA

Theorem 8 The minimizer of (31) satisfies


1. The fraction of points for which |yi − f (xi )| < (xi ) is bounded by 1 − ν.
2. The fraction of constraints (31b) and (31c) with ξi > 0 or ξi∗ > 0 is bounded from above by ν.
3. If (x, y) is drawn iid from a distribution Pr(x, y), with Pr(y|x) continuous and the expectation of
the modulus of absolute continuity of its density satisfying limδ→0 E [(δ)] = 0. With probability 1,
asymptotically, the fraction of points satisfying |yi − f (xi )| = (xi ) converges to 0.
Moreover, imposing  ≥ 0 is equivalent to relaxing (32c) to ~1> (α − α∗ ) ≤ Cmν. If in addition K2 has only
nonnegative entries then also (x) ≥ 0 for all xi .
Proof The proof is essentially identical to that of Lemma 3 and that of Schölkopf et al. (2000). However
note that the flexibility in  and potential (x) < 0 lead to additional complications. However, if both f and
(x) have well behaved entropy numbers, then also f ±  are well behaved.
To see the last set of claims note that the constraint ~1> (α − α∗ ) ≤ Cmν is obtained again directly from
dualization via the condition  ≥ 0. Since αi , αi∗ ≥ 0 for all i it follows that (x) contains only nonnegative
coefficients, which proves the last part of the claim.
Note that in principle we could enforce (xi ) ≥ 0 for all xi . This way, however, we would lose the ν-
property and add even more complication to the optimization problem. A third set of Lagrange multipliers
would have to be added to the optimization problem.

A.3 An Example
The above derivation begs the question why one should not use (32) instead of (6) for the purpose of quantile
regression. After all, both estimators yield an estimate for the upper and lower quantiles.
Firstly, the combined approach is numerically more costly as it requires optimization over twice the
number of parameters, albeit at the distinct advantage of a sparse solution, whereas (6) always leads to a
dense solution.
The key difference, however, is that (32) is prone to producing estimates where the margin (x) < 0.
While such a solution is clearly unreasonable, it occurs whenever the margin is rather small and the overall
tradeoff of simple f vs. simple  yields an advantage by keeping f simple. With enough data this effect
vanishes, however, it occurs quite frequently, even with supposedly distant quantiles, as can be seen in
Figure 9.
In addition, the latter suffers from the assumption that the error be symmetrically distributed. In other
words, if we are just interested in obtaining the 0.95 quantile estimate we end up estimating the 0.05 quantile
on the way. In addition to that, we make the assumption that the additive noise is symmetric.
We produced this derivation and experiments mainly to make the point that the adaptive margin approach
of Schölkopf et al. (2000) is insufficient to address the problems posed by quantile regression. We found
empirically that it is much easier to adjust QR instead of the symmetric variant.
In summary, the symmetric approach is probably useful only for parametric estimates where the number
of parameters is small and where the expansion coefficients ensure that (x) ≥ 0 for all x.

Appendix B. Experimental Results


Here we assemble six tables to display the results across the four models. The first three tables report the
pinball loss for each data set and the standard devation across the 10 test runs. A lower figure is preferred

1024
N ONPARAMTERIC Q UANTILE E STIMATION

Figure 9: Illustration of the heteroscedastic SVM regression on toy dataset. On the left, λ1 = 1, λ2 = 10
and ν = 0.2, the algorithm successfully regresses the data. On the right, λ1 = 1, λ2 = 0.1 and
ν = 0.2, the algorithm fails to regress the data as  becomes negative.

in each case. NA denotes cases where rqss Koenker (2005) was unable to produce estimates, due to its
construction of the function system.
In the next three tables we measure the ramp loss. In each table a figure close the the intended quantile
(10, 50 or 90) is preferred. For further discussion see the Results section of the paper.

1025
TAKEUCHI , L E , S EARS AND S MOLA

Data Set uncond linear rqss npqr noncross


caution 11.09 ± 02.56 11.18 ± 03.37 09.18 ± 03.09 09.56 ± 03.06 09.56 ± 03.05
ftcollinssnow 16.31 ± 05.31 16.55 ± 06.00 17.52 ± 05.12 16.31 ± 05.18 16.31 ± 05.18
highway 11.38 ± 05.79 16.36 ± 09.65 20.51 ± 19.52 09.16 ± 05.04 10.14 ± 06.12
heights 17.20 ± 02.23 15.28 ± 02.21 15.28 ± 02.23 15.26 ± 02.19 15.26 ± 02.20
sniffer 13.98 ± 02.63 06.66 ± 01.67 05.29 ± 01.79 05.30 ± 01.91 05.25 ± 01.91
snowgeese 08.71 ± 04.21 04.64 ± 02.40 04.65 ± 02.44 05.22 ± 02.70 05.22 ± 02.70
ufc 17.03 ± 02.86 10.01 ± 01.35 10.11 ± 01.12 09.69 ± 01.37 09.68 ± 01.29
birthwt 18.31 ± 02.59 18.39 ± 02.39 18.73 ± 02.93 17.41 ± 03.50 18.84 ± 04.77
crabs 18.27 ± 03.36 01.03 ± 00.33 NA 00.91 ± 00.24 00.92 ± 00.24
GAGurine 11.08 ± 01.47 07.22 ± 01.30 05.82 ± 01.03 06.03 ± 01.56 06.03 ± 01.55
geyser 17.11 ± 01.97 11.51 ± 01.15 11.10 ± 01.39 10.91 ± 01.35 10.91 ± 01.35
gilgais 12.88 ± 01.51 05.92 ± 01.59 05.75 ± 01.79 05.44 ± 01.66 05.43 ± 01.64
topo 20.38 ± 08.61 09.22 ± 03.68 08.19 ± 03.53 06.01 ± 02.36 06.03 ± 02.36
BostonHousing 14.07 ± 01.77 06.61 ± 01.05 NA 05.11 ± 01.23 05.05 ± 01.31
CobarOre 17.72 ± 08.95 16.55 ± 06.49 12.83 ± 06.36 13.26 ± 11.29 13.19 ± 11.19
engel 11.93 ± 01.82 06.51 ± 02.33 05.70 ± 01.17 05.55 ± 00.76 05.55 ± 00.76
mcycle 20.03 ± 02.38 17.81 ± 03.43 10.98 ± 02.43 07.43 ± 03.09 07.43 ± 03.10
BigMac2003 08.67 ± 02.40 06.46 ± 02.08 NA 06.22 ± 01.77 06.31 ± 02.56
UN3 18.02 ± 04.53 11.57 ± 02.28 NA 11.61 ± 02.18 11.63 ± 02.19
cpus 05.25 ± 01.75 01.73 ± 00.89 00.74 ± 00.37 00.77 ± 00.64 00.73 ± 00.61

Table 2: Method Comparison: Pinball Loss (×100, τ = 0.1)

1026
N ONPARAMTERIC Q UANTILE E STIMATION

Data Set uncond linear rqss npqr noncross


caution 38.16 ± 10.19 32.40 ± 08.39 23.76 ± 08.09 22.07 ± 08.66 22.07 ± 08.66
ftcollinssnow 41.96 ± 11.03 41.00 ± 11.34 42.28 ± 11.21 39.40 ± 11.31 39.91 ± 11.25
highway 41.86 ± 22.46 39.47 ± 19.26 26.05 ± 12.27 26.43 ± 16.67 26.42 ± 16.45
heights 40.09 ± 02.99 34.50 ± 02.88 34.66 ± 02.86 34.56 ± 02.92 34.56 ± 02.92
sniffer 35.64 ± 06.12 12.63 ± 03.88 10.23 ± 02.76 09.64 ± 02.85 09.64 ± 02.82
snowgeese 31.31 ± 15.80 13.23 ± 09.00 10.95 ± 08.82 18.89 ± 15.01 18.95 ± 15.03
ufc 40.17 ± 05.26 23.20 ± 02.64 21.21 ± 02.68 21.20 ± 02.47 21.22 ± 02.46
birthwt 41.13 ± 07.31 38.14 ± 06.97 37.28 ± 05.97 37.23 ± 06.94 37.25 ± 06.94
crabs 41.47 ± 07.03 02.24 ± 00.44 NA 02.15 ± 00.58 02.15 ± 00.57
GAGurine 36.60 ± 04.26 23.61 ± 04.19 16.08 ± 03.14 14.60 ± 03.87 14.60 ± 03.86
geyser 41.28 ± 07.17 32.30 ± 04.55 30.79 ± 03.88 30.51 ± 04.05 30.47 ± 04.07
gilgais 42.02 ± 05.30 16.11 ± 03.91 11.76 ± 02.92 12.35 ± 02.33 12.32 ± 02.32
topo 41.23 ± 15.98 26.13 ± 08.79 18.02 ± 09.43 14.21 ± 05.55 15.19 ± 05.97
BostonHousing 35.63 ± 05.28 17.51 ± 03.54 NA 10.77 ± 01.90 10.74 ± 01.89
CobarOre 42.14 ± 19.73 41.65 ± 18.84 44.24 ± 12.18 37.35 ± 22.05 38.46 ± 22.11
engel 35.83 ± 07.13 13.73 ± 03.15 13.23 ± 02.05 13.01 ± 01.74 12.95 ± 01.67
mcycle 38.73 ± 09.72 38.19 ± 09.16 21.02 ± 05.18 17.20 ± 05.25 17.17 ± 05.26
BigMac2003 34.97 ± 10.89 21.99 ± 07.11 NA 18.16 ± 08.08 18.08 ± 08.10
UN3 40.83 ± 08.81 26.45 ± 04.30 NA 24.50 ± 03.51 24.35 ± 03.51
cpus 23.03 ± 08.61 05.69 ± 02.23 02.49 ± 01.79 01.34 ± 01.18 01.33 ± 01.18

Table 3: Method Comparison: Pinball Loss (×100, τ = 0.5)

1027
TAKEUCHI , L E , S EARS AND S MOLA

Data Set uncond linear rqss npqr noncross


caution 23.28 ± 09.63 15.04 ± 03.37 13.19 ± 03.36 15.16 ± 03.95 15.15 ± 03.94
ftcollinssnow 18.80 ± 04.45 19.73 ± 06.14 20.18 ± 06.41 19.70 ± 05.78 19.55 ± 05.75
highway 25.89 ± 13.58 21.83 ± 18.57 17.63 ± 14.94 12.84 ± 06.69 22.85 ± 14.07
heights 17.64 ± 01.28 15.47 ± 00.85 15.50 ± 00.91 15.47 ± 00.86 15.48 ± 00.86
sniffer 23.38 ± 09.69 05.82 ± 01.63 05.84 ± 01.57 05.17 ± 00.97 05.11 ± 00.98
snowgeese 26.60 ± 18.81 07.79 ± 08.98 08.51 ± 012.5 08.47 ± 08.07 08.48 ± 08.07
ufc 18.03 ± 02.89 10.94 ± 01.31 10.83 ± 01.51 10.54 ± 01.70 10.49 ± 01.65
birthwt 16.18 ± 03.34 16.13 ± 03.22 16.36 ± 03.72 15.16 ± 03.02 15.17 ± 03.02
crabs 17.09 ± 03.08 00.99 ± 00.24 NA 01.13 ± 00.30 01.13 ± 00.30
GAGurine 22.65 ± 05.14 15.72 ± 05.07 10.57 ± 03.27 10.16 ± 03.17 10.16 ± 03.17
geyser 14.12 ± 02.53 12.83 ± 02.34 12.37 ± 02.47 11.99 ± 02.56 12.00 ± 02.55
gilgais 18.91 ± 01.99 06.75 ± 02.07 05.07 ± 01.68 05.51 ± 00.81 05.51 ± 00.81
topo 16.96 ± 07.12 13.46 ± 11.52 13.16 ± 11.01 09.75 ± 06.13 09.66 ± 06.15
BostonHousing 22.62 ± 05.33 11.59 ± 02.94 NA 06.97 ± 02.76 06.86 ± 02.71
CobarOre 17.21 ± 04.31 21.76 ± 06.03 19.38 ± 05.21 14.98 ± 08.57 15.08 ± 08.50
engel 22.59 ± 06.86 05.43 ± 01.08 05.64 ± 01.81 05.53 ± 01.18 05.53 ± 01.18
mcycle 16.10 ± 03.21 14.16 ± 03.44 10.69 ± 03.57 07.03 ± 01.95 07.01 ± 01.96
BigMac2003 24.48 ± 17.33 13.47 ± 06.21 NA 09.94 ± 09.97 09.97 ± 09.96
UN3 16.36 ± 02.97 10.38 ± 02.19 NA 08.80 ± 01.82 08.81 ± 01.82
cpus 23.61 ± 10.46 02.69 ± 00.57 01.83 ± 02.31 01.31 ± 01.84 01.31 ± 01.84

Table 4: Method Comparison: Pinball Loss (×100, τ = 0.9)

1028
N ONPARAMTERIC Q UANTILE E STIMATION

Data Set uncond linear rqss npqr noncross


caution 11.0 ± 08.8 12.0 ± 09.2 16.0 ± 10.7 12.0 ± 14.0 12.0 ± 14.0
ftcollinssnow 10.0 ± 09.7 11.1 ± 09.1 12.2 ± 11.0 12.2 ± 09.7 12.2 ± 09.7
highway 10.8 ± 15.7 20.0 ± 23.3 26.7 ± 37.8 20.0 ± 23.3 13.3 ± 17.2
heights 09.6 ± 02.8 10.0 ± 02.4 10.0 ± 02.2 10.0 ± 02.3 09.9 ± 02.3
sniffer 07.8 ± 10.1 13.7 ± 09.6 12.0 ± 13.1 15.9 ± 11.4 15.9 ± 11.4
snowgeese 12.5 ± 17.7 09.7 ± 12.6 09.7 ± 12.6 13.6 ± 17.1 13.6 ± 17.1
ufc 09.7 ± 03.9 09.9 ± 05.4 11.8 ± 04.0 10.5 ± 04.6 10.7 ± 03.9
birthwt 10.0 ± 07.8 12.0 ± 06.7 12.6 ± 05.1 11.6 ± 06.7 10.0 ± 08.6
crabs 10.0 ± 08.5 12.0 ± 09.8 NA 13.3 ± 08.1 13.0 ± 08.9
GAGurine 10.4 ± 05.1 09.9 ± 04.7 10.7 ± 06.4 12.1 ± 06.4 11.6 ± 06.5
geyser 09.7 ± 08.3 11.2 ± 06.2 10.7 ± 06.9 12.2 ± 07.0 12.1 ± 06.8
gilgais 09.5 ± 06.9 10.4 ± 04.9 13.5 ± 04.6 12.4 ± 05.2 12.4 ± 05.2
topo 08.9 ± 15.0 13.4 ± 13.3 16.0 ± 24.6 19.4 ± 16.4 19.4 ± 16.4
BostonHousing 09.7 ± 04.7 11.5 ± 04.6 NA 15.0 ± 04.2 15.1 ± 04.4
CobarOre 08.5 ± 14.3 12.7 ± 22.8 16.1 ± 17.0 16.1 ± 23.2 16.1 ± 23.2
engel 10.2 ± 07.1 09.4 ± 06.8 10.2 ± 07.9 12.2 ± 07.4 12.2 ± 07.4
mcycle 10.0 ± 09.6 11.5 ± 09.1 11.4 ± 09.1 12.0 ± 08.2 12.0 ± 08.2
BigMac2003 09.0 ± 11.4 18.0 ± 22.9 NA 14.3 ± 19.4 16.0 ± 18.7
UN3 09.5 ± 10.0 12.0 ± 09.7 NA 10.3 ± 07.7 10.3 ± 07.7
cpus 09.4 ± 08.9 12.2 ± 10.2 15.3 ± 07.9 19.1 ± 08.3 20.6 ± 11.7

Table 5: Method Comparison: Ramp Loss(×100, τ = 0.1)

1029
TAKEUCHI , L E , S EARS AND S MOLA

Data Set uncond linear rqss npqr noncross


caution 52.0 ± 22.5 49.0 ± 13.7 51.0 ± 14.5 49.0 ± 17.3 49.0 ± 17.3
ftcollinssnow 50.6 ± 14.0 49.7 ± 16.9 48.6 ± 19.8 51.4 ± 24.3 51.4 ± 26.0
highway 48.3 ± 31.9 44.2 ± 38.5 45.0 ± 38.5 41.7 ± 26.4 45.0 ± 31.5
heights 49.3 ± 05.7 50.1 ± 05.1 49.8 ± 04.9 50.3 ± 05.0 50.3 ± 05.0
sniffer 47.8 ± 08.1 51.0 ± 13.0 51.0 ± 11.8 51.3 ± 15.2 51.3 ± 15.2
snowgeese 48.1 ± 27.6 49.2 ± 32.7 51.7 ± 26.9 50.6 ± 23.6 50.6 ± 23.6
ufc 49.2 ± 08.6 50.0 ± 06.8 51.6 ± 06.8 50.6 ± 04.0 50.6 ± 04.0
birthwt 48.9 ± 14.3 50.0 ± 14.3 47.8 ± 13.9 50.3 ± 10.7 50.2 ± 10.8
crabs 49.5 ± 10.9 50.5 ± 09.8 NA 50.0 ± 08.2 49.5 ± 08.3
GAGurine 49.2 ± 11.8 50.9 ± 08.0 51.4 ± 17.0 49.8 ± 11.8 49.9 ± 12.0
geyser 48.6 ± 11.2 49.8 ± 07.8 49.5 ± 06.8 49.2 ± 07.5 49.6 ± 07.2
gilgais 48.7 ± 10.5 50.0 ± 10.6 49.7 ± 10.0 50.7 ± 10.6 50.8 ± 10.7
topo 47.7 ± 23.3 47.7 ± 19.1 47.7 ± 21.3 54.8 ± 22.7 56.3 ± 22.5
BostonHousing 49.7 ± 06.0 49.6 ± 08.4 NA 51.7 ± 05.4 51.5 ± 05.6
CobarOre 46.4 ± 23.0 44.5 ± 22.2 47.9 ± 27.7 59.4 ± 27.0 59.4 ± 27.0
engel 50.9 ± 09.0 49.7 ± 08.6 49.6 ± 08.6 50.0 ± 09.6 50.1 ± 09.9
mcycle 49.1 ± 11.7 51.3 ± 11.6 51.4 ± 13.7 48.8 ± 15.4 48.1 ± 14.4
BigMac2003 49.3 ± 14.6 50.0 ± 20.8 NA 44.2 ± 21.5 43.7 ± 21.2
UN3 49.4 ± 09.6 50.6 ± 11.8 NA 48.6 ± 11.7 47.8 ± 11.3
cpus 49.2 ± 13.7 51.3 ± 18.3 49.7 ± 11.7 51.8 ± 10.8 46.9 ± 11.6

Table 6: Method Comparison: Ramp Loss (×100, τ = 0.5)

1030
N ONPARAMTERIC Q UANTILE E STIMATION

Data Set uncond linear rqss npqr noncross


caution 90.0 ± 10.5 90.0 ± 10.5 89.0 ± 12.0 89.0 ± 09.9 89.0 ± 09.9
ftcollinssnow 90.3 ± 11.1 89.2 ± 12.9 88.3 ± 12.9 89.2 ± 12.9 89.2 ± 12.9
highway 89.2 ± 22.2 64.2 ± 32.4 61.7 ± 29.4 70.0 ± 29.2 56.7 ± 38.7
heights 89.5 ± 02.3 90.0 ± 01.8 89.8 ± 01.8 90.1 ± 01.9 90.1 ± 01.9
sniffer 89.4 ± 07.0 87.6 ± 12.4 86.8 ± 10.4 84.6 ± 10.1 85.4 ± 10.3
snowgeese 88.9 ± 12.4 85.0 ± 17.5 85.0 ± 17.5 83.9 ± 20.5 83.9 ± 20.5
ufc 89.8 ± 05.1 90.3 ± 05.2 88.5 ± 06.3 88.3 ± 05.0 88.2 ± 05.2
birthwt 88.7 ± 09.7 87.6 ± 10.0 88.0 ± 09.0 88.9 ± 08.3 88.9 ± 08.3
crabs 89.0 ± 09.7 87.0 ± 08.9 NA 87.1 ± 10.6 87.0 ± 10.6
GAGurine 89.5 ± 03.8 89.8 ± 06.3 89.4 ± 05.6 87.8 ± 08.2 88.1 ± 07.8
geyser 88.5 ± 05.6 89.4 ± 06.4 90.4 ± 06.0 89.1 ± 04.9 89.0 ± 05.1
gilgais 89.1 ± 06.0 88.3 ± 04.5 87.1 ± 06.7 83.9 ± 04.7 84.2 ± 04.7
topo 89.1 ± 15.0 87.1 ± 14.8 85.7 ± 19.4 77.7 ± 17.9 77.7 ± 17.9
BostonHousing 90.1 ± 04.4 88.8 ± 06.1 NA 80.3 ± 05.4 80.8 ± 05.5
CobarOre 89.1 ± 15.7 85.8 ± 16.7 79.1 ± 22.6 85.8 ± 16.7 85.8 ± 16.7
engel 88.9 ± 06.3 90.0 ± 06.6 89.1 ± 06.9 89.4 ± 06.6 89.4 ± 06.6
mcycle 88.6 ± 07.6 88.8 ± 07.4 87.7 ± 07.4 86.2 ± 05.7 87.4 ± 05.7
BigMac2003 89.3 ± 08.0 84.3 ± 16.0 NA 77.7 ± 21.7 79.3 ± 21.4
UN3 88.0 ± 14.8 86.7 ± 09.8 NA 85.8 ± 09.7 86.7 ± 10.5
cpus 89.3 ± 07.1 87.8 ± 07.7 82.6 ± 06.4 82.1 ± 11.6 82.5 ± 12.3

Table 7: Method Comparison: Ramp Loss (×100, τ = 0.9)

1031
TAKEUCHI , L E , S EARS AND S MOLA

Dataset Loss τ npqr npqrm


0.1 0.65 ± 0.15 0.66 ± 0.16
cars pinball loss 0.5 1.59 ± 0.32 1.61 ± 0.23
0.9 0.79 ± 0.16 0.77 ± 0.16
0.1 0.12 ± 0.05 0.11 ± 0.05
ramp loss 0.5 0.51 ± 0.05 0.51 ± 0.08
0.9 0.89 ± 0.05 0.89 ± 0.05
0.1 2.68 ± 1.21 2.27 ± 0.71
onions pinball loss 0.5 4.93 ± 1.58 4.89 ± 1.47
0.9 1.86 ± 0.73 1.84 ± 0.37
0.1 0.18 ± 0.20 0.17 ± 0.16
ramp loss 0.5 0.48 ± 0.24 0.48 ± 0.27
0.9 0.86 ± 0.16 0.80 ± 0.27

Table 8: Comparison between the quantile regression without (npqr) and with (npqrm) monotonicity con-
straints. We tested on the cars and the onions dataset for monotonicity with respect to engine
size and diameter respectively. Note that on the engines dataset the monotonicity constraint is not
perfectly satisfied, as is manifest in the somewhat worse performance of npqrm, whereas for the
onions yield estimation performance is improved when imposing monotonicity constraints.

1032

You might also like