Notes On Subgradients
Notes On Subgradients
1 Definition
We say a vector g ∈ Rn is a subgradient of f : Rn → R at x ∈ dom f if for all z ∈ dom f ,
f (z) ≥ f (x) + g T (z − x). (1)
If f is convex and differentiable, then its gradient at x is a subgradient. But a subgradient
can exist even when f is not differentiable at x, as illustrated in figure 1. The same example
shows that there can be more than one subgradient of a function f at a point x.
There are several ways to interpret a subgradient. A vector g is a subgradient of f at x
if the affine function (of z) f (x) + g T (z − x) is a global underestimator of f . Geometrically,
g is a subgradient of f at x if (g, −1) supports epi f at (x, f (x)), as illustrated in figure 2.
A function f is called subdifferentiable at x if there exists at least one subgradient at
x. The set of subgradients of f at the point x is called the subdifferential of f at x, and
is denoted ∂f (x). A function f is called subdifferentiable if it is subdifferentiable at all
x ∈ dom f .
Example. Absolute value. Consider f (z) = |z|. For x < 0 the subgradient is unique:
∂f (x) = {−1}. Similarly, for x > 0 we have ∂f (x) = {1}. At x = 0 the subdifferential
is defined by the inequality |z| ≥ gz for all z, which is satisfied if and only if g ∈ [−1, 1].
Therefore we have ∂f (0) = [−1, 1]. This is illustrated in figure 3.
2 Basic properties
The subdifferential ∂f (x) is always a closed convex set, even if f is not convex. This follows
from the fact that it is the intersection of an infinite set of halfspaces:
\
∂f (x) = {g | f (z) ≥ f (x) + g T (z − x)}.
z∈dom f
1
f
g1
g2
g3
x1 x2
f s
y1
x2
x1
y2
Figure 3: The absolute value function (left), and its subdifferential ∂f (x) as a
function of x (right).
2
If ∂f (x) is unbounded, then there is a sequence gn ∈ ∂f (x) such that kgn k2 → ∞. Taking the
sequence yn = x + gn / kgn k2 , we find that f (yn ) ≥ f (x) + gnT (yn − x) = f (x) + kgn k2 → ∞,
which is a contradiction to f (yn ) being bounded.
which shows that −a/b ∈ ∂f (x). Now we show that b 6= 0, i.e., that the supporting
hyperplane cannot be vertical. If b = 0 we conclude that aT (z − x) ≤ 0 for all z ∈ dom f .
This is impossible since x ∈ int dom f .
This discussion shows that a convex function has a subgradient at x if there is at least
one nonvertical supporting hyperplane to epi f at (x, f (x)). This is the case, for example, if
f is continuous. There are pathological convex functions which do not have subgradients at
some points, but we will assume in the sequel that all convex functions are subdifferentiable
(at every point in dom f ).
3
While this simple characterization of optimality via the subdifferential holds for noncon-
vex functions, it is not particularly useful in that case, since we generally cannot find the
subdifferential of a nonconvex function.
The condition 0 ∈ ∂f (x? ) reduces to ∇f (x? ) = 0 when f is convex and differentiable at
x? .
To see this inequality, note that f 0 (x; v) ≥ supg∈∂f (x) g T v by the definition of a subgradient:
f (x + tv) − f (x) ≥ tg T v for any t ∈ R and g ∈ ∂f (x), so f 0 (x; v) ≥ supg∈∂f (x) g T v. For the
other direction, we claim that all affine functions that are below the function v 7→ f 0 (x; v)
may be taken to be linear. Specifically, suppose that (g, r) ∈ Rn × R and g T v − r ≤ f 0 (x; v)
for all v. Then r ≥ 0, as taking v = 0 gives −r ≤ f 0 (x; 0) = 0. By the positive homogeneity
of f 0 (x; v), we see that for any t ≥ 0 we have tg T v − r ≤ f 0 (x; tv) = tf 0 (x; v), and thus we
have
r
g T v − ≤ f 0 (x; v) for all t > 0.
t
1
This is simply the standard definition of differentiability.
4
−g
y
x?
Figure 4: The point x? minimizes f over X (the shown level curves) if and only if
for some g ∈ ∂f (x? ), g T (y − x? ) ≥ 0 for all y ∈ X. Note that not all subgradients
satisfy this inequality.
Taking t → +∞ gives that any affine minorizer of f 0 (x; v) may be taken to be linear. As any
(closed) convex function can be written as the supremum of its affine minorants, we have
f 0 (x; v) = sup g T v | g T ∆ ≤ f 0 (x; ∆) for all ∆ ∈ Rn .
On the other hand, if g T ∆ ≤ f 0 (x; ∆) for all ∆ ∈ Rn , then we have g T ∆ ≤ f (x+∆)−f (x), so
that g ∈ ∂f (x), and we may as well have taken the preceding supremum only over g ∈ ∂f (x).
g T (y − x? ) ≥ 0 for all y ∈ X.
5
X. We suppose that f (x) ≥ f (x? ) for all x ∈ X. In this case, for any x ∈ X, the directional
derivative
f (x? + t(x − x? )) − f (x? )
f 0 (x? ; x − x? ) = lim ≥ 0,
t&0 t
that is, for any x, the direction ∆ = x − x? pointing into X satisfies f 0 (x? ; ∆) ≥ 0.
By our characterization of the directional derivative earlier, we know that f 0 (x? ; ∆) =
supg∈∂f (x? ) g T ∆ ≥ 0. Thus, defining the ball B = {y + x? ∈ Rn | kyk2 ≤ }, we have
inf sup g T (x − x? ) ≥ 0.
x∈X∩B g∈∂f (x? )
As ∂f (x? ) is bounded, we may swap the min and max (see, for example, Exercise 5.25 of
[BV04]), finding that there must exist some g ∈ ∂f (x? ) such that
inf g T (x − x? ) ≥ 0.
x∈X∩B
But any y ∈ X may be written as t(x − x? ) + x? for some t ≥ 0 and x ∈ X ∩ B , which gives
the result.
For fuller explanations of these inequalities and derivations, see also the books by Hiriart-
Urruty and Lemaréchal [HUL93, HUL01].
3 Calculus of subgradients
In this section we describe rules for constructing subgradients of convex functions. We
will distinguish two levels of detail. In the ‘weak’ calculus of subgradients the goal is to
produce one subgradient, even if more subgradients exist. This is sufficient in practice, since
subgradient, localization, and cutting-plane methods require only a subgradient at any point.
A second and much more difficult task is to describe the complete set of subgradients
∂f (x) as a function of x. We will call this the ‘strong’ calculus of subgradients. It is useful
in theoretical investigations, for example, when describing the precise optimality conditions.
This property extends to infinite sums, integrals, and expectations (provided they exist).
6
3.3 Affine transformations of domain
Suppose f is convex, and let h(x) = f (Ax + b). Then ∂h(x) = AT ∂f (Ax + b).
where the functions fi are subdifferentiable. We first show how to construct a subgradient
of f at x.
Let k be any index for which fk (x) = f (x), and let g ∈ ∂fk (x). Then g ∈ ∂f (x). In other
words, to find a subgradient of the maximum of functions, we can choose one of the functions
that achieves the maximum at the point, and choose any subgradient of that function at the
point. This follows from
i.e., the subdifferential of the maximum of functions is the convex hull of the union of
subdifferentials of the ‘active’ functions at x.
At a point x where only one of the functions, say fk , is active, f is differentiable and
has gradient ∇fk (x). At a point x where several of the functions are active, ∂f (x) is
a polyhedron.
so we can apply the rules for the subgradient of the maximum. The first step is to
identify an active function sT x, i.e., find an s ∈ {−1, +1}n such that sT x = kxk1 . We
can choose si = +1 if xi > 0, and si = −1 if xi < 0. If xi = 0, more than one function
7
is active, and both si = +1, and si = −1 work. The function sT x is differentiable and
has a unique subgradient s. We can therefore take
+1 xi > 0
gi = −1 xi < 0
−1 or + 1 xi = 0.
The subdifferential is the convex hull of all subgradients that can be generated this
way:
∂f (x) = {g | kgk∞ ≤ 1, g T x = kxk1 }.
3.5 Supremum
Next we consider the extension to the supremum over an infinite number of functions, i.e.,
we consider
f (x) = sup fα (x),
α∈A
where the functions fα are subdifferentiable. We only discuss the weak property.
Suppose the supremum in the definition of f (x) is attained. Let β ∈ A be an index for
which fβ (x) = f (x), and let g ∈ ∂fβ (x). Then g ∈ ∂f (x). If the supremum in the definition
is not attained, the function may or may not be subdifferentiable at x, depending on the
index set A.
Assume however that A is compact (in some metric), and that the function α 7→ fα (x)
is upper semi-continuous for each x. Then
∂f (x) = Co ∪ {∂fα (x) | fα (x) = f (x)}.
Example. Maximum eigenvalue of a symmetric matrix. Let f (x) = λmax (A(x)),
where A(x) = A0 + x1 A1 + · · · + xn An , and Ai ∈ Sm . We can express f as the
pointwise supremum of convex functions,
f (x) = λmax (A(x)) = sup y T A(x)y.
kyk2 =1
8
Example. Maximum eigenvalue of a symmetric matrix, revisited. Let f (A) = λmax (A),
where A ∈ Sn , the symmetric n-by-n matrices. Then as above, f (A) = λmax (A) =
supkyk2 =1 y T Ay, but we note that y T Ay = Tr(Ayy T ), so that each of the functions
fy (A) = y T Ay is linear in A with gradient ∇fy (A) = yy T . Then using an identical
argument to that above, we find that
∂f (A) = Co yy T | kyk2 = 1, y T Ay = λmax (A) = Co yy T | kyk2 = 1, Ay = λmax (A)y ,
the convex hull of the outer products of maximum eigenvectors of the matrix A.
9
which is jointly convex in x, y, z. Subgradients of f can be related to the dual problem
of (3) as follows.
Suppose we are interested in subdifferentiating f at (x̂, ŷ). We can express the dual
problem of (3) as
maximize g(λ) − xT λ − y T ν
(4)
subject to λ 0.
where !
X
m
g(λ) = inf f0 (z) + λi fi (z) + ν T Az .
z
i=1
Suppose strong duality holds for problems (3) and (4) at x = x̂ and y = ŷ, and that the
dual optimum is attained at λ? , ν ? (for example, because Slater’s condition holds). From
the global perturbation inequalities we know that
4 Quasigradients
If f (x) is quasiconvex, then g is a quasigradient at x0 if
g T (x − x0 ) ≥ 0 ⇒ f (x) ≥ f (x0 ),
Geometrically, g defines a supporting hyperplane to the sublevel set {x | f (x) ≤ f (x0 )}.
Note that the set of quasigradients at x0 form a cone.
T
Example. Linear fractional function. f (x) = acT x+d x+b
. Let cT x0 + d > 0. Then
g = a − f (x0 )c is a quasigradient at x0 . If cT x + d > 0, we have
implies bk+1 6= 0.
10
5 Clarke Subdifferential
Now we explore a generalization of the notion of subdifferential that enables the analysis of
non-convex and non-smooth functions through convex analysis. We will introduce Clarke
Subdifferential, which is a natural generalization [Cla90] of the subdifferential set in terms
of convex hulls.
A well-known result due to Rademacher states that a locally Lipschitz function is differen-
tiable almost everywhere (see e.g., Theorem 9.60 in [RW09]). In particular, every neighbor-
hood of x contains a point y for which ∇f (y) exists. This motivates the following construction
known as the Clarke subdifferential
n o
∂C f (x) = Co s ∈ R : ∃x → x, ∇f (x ) exists, and ∇f (x ) → s .
n k k k
We can check that the absolute value function f (x) = |x| satisfies ∇C f (0) = [−1, 1]. Like-
wise, the function −f (x) satisfies ∇C (−f (0)) = [−1, 1]. For convex functions, we will see
that the Clarke subdifferential reduces to the ordinary subdifferential defined in Section 1.
Compared to the usual directional derivative (2), Clarke directional derivative (5) is able to
capture the behavior of the function in a neighborbood of x rather than just along a ray
emanating from x.
We have the following generalization of (2)
f ◦ (x, d) = max sT d,
s∈∇C f (x)
11
in general. Consider the function f : R ! R given by
f (x) = max{x, 0}+min{0, x}. Let us compute @C f1 (0),
> 0, @C f2 (0), and @C f (0):
wise.
derivative f1 (x) = max{x,0} f2 (x) = min{x,0} f (x) =f1 (x)+f2 (x)
g( x1 ))
this and
s locally
e of f at
quotient @C f1 (0) = [0,1] @C f2 (0) = [0,1] @C f (0) = {1}
e, as can
(n+ 12 )⇡ Figure 5: Clarke subdifferential of the sum of two non-differentiable functions.
NoteObserve that rule does not hold in this example since the function f2 (x) =
that the addition
min{x, 0} is not subdifferentially regular [LSM20].
s even, @C f (0) = {1} ( @C f1 (0) + @C f2 (0) = [0, 2].
wise.
which shows that the support function of the Clarke subdifferential at any point x evaluated
The
at d is equal failure
to the Clarkeofdirectional
the sumderivative
rule isatone
x in of
the the obstacles
direction d. to
f directionalthe Clarkecomputing
Note that the Clarke the Clarke subgradient. Nevertheless, not allto the sum of
subdifferential of the sum of two functions is not equal
subdifferentials (see Figure 5 for an example). Nevertheless, we have the weaker
of view. Thesum rule is lost, as we still have the following weaker version of
anating from the sum rule: ∇C (f1 + f2 ) ⊆ ∇C f1 + ∇C f2 .
with tk & 0), It can be shown that the sum rule holds with equality if the functions are subdifferentially
the directionregular, which are locally Lipschitz + f2 ) ✓for@Cwhich
@C (f1 functions f1 +the@Cordinary
f2 ; directional derivative (2)
0 ◦
k
+ tk d) vs.and Clarke directional derivative (5) coincide, i.e., f (x, d) = f (x, d) ∀x, d. It follows that
the latter isis identicalsee [28,
convex functions are Proposition
subdifferentially1.12].
regular. This implies that the Clarke subdifferential
Example. Stationary points. Consider the function f (x) = min{x, 0}. It can be seen
that ∇C f (x) = {0} for x > 0, and such points are neither local minima nor local
maxima.
References
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[RW09] R. T. Rockafellar and T. J-B. Wets. Variational analysis, volume 317. Springer
Science & Business Media, 2009.
13