Three Strategies To Derive A Dual Problem
Three Strategies To Derive A Dual Problem
Ryota Tomioka
May 18, 2010
There are three strategies, namely, (i) equality constraints, (ii) conic con-
straints, and (iii) Fenchel’s duality to derive a dual problem.
Using a group-lasso regularized support vector machine (=MKL) problem
as an example, we see how these strategies can be used to derive dual problems
that look different but are actually equivalent.
More specifically, we are interested in a dual of the following problem:
∑
m ∑
(P) minimize
n
ℓH (x(i)⊤ w, y (i) ) + λ ∥wg ∥2 ,
w∈R
i=1 g∈G
1
Following the above recipe, we first notice that there is no equality constraint
in the above primal problem (P). Thus we introduce an auxiliary variable z ∈
Rm , and rewrite (P) as follows:
∑
m ∑
(P1 ) minimize
n m
(1 − zi )+ + λ ∥wg ∥2 ,
w∈R ,z∈R
i=1 g∈G
Note that the way we introduce equality constraints is not unique. For example,
we could have (1 − yi zi )+ in the objective subject to x(i)⊤ w = zi . Nevertheless,
as long as the mapping is one to one, this choice is not important. The current
choice is made to mimic the most common representation of SVM dual.
Now we are ready to form a Lagrangian L(w, z, α), where α = (αi )m i=1 are
Lagrangian multipliers associated with the m equality constraints in (P1 ). The
Lagrangian can be written as follows:
∑
m ∑ ∑
m
L(w, z, α) = (1 − zi )+ + λ ∥wg ∥2 + αi (zi − y (i) x(i)⊤ w).
i=1 g∈G i=1
2
2 Using conic constraints
The second strategy to derive a dual problem is based on finding a conic struc-
ture in the primal problem. A cone K is a subset of some vector space such
that if x ∈ K, for any nonnegative α, we have αx ∈ K.
The most common cone we encounter is the positive orthant cone; i.e.,
K = {x ∈ Rn : x ≥ 0}.
K ∗ = {y ∈ Rn : y ⊤ x ≥ 0 (∀x ∈ K)}.
In other words, the dual cone is a collection of vectors that have nonnegative
inner products with all the vectors in K. Note that both the positive orthant
cone and the second order cone are self-dual; i.e., K ∗ = K.
Why is a cone useful? Because, when we consider the minimization of a
Lagrangian and see a term like f (α)⊤ x and x ∈ K, we know that the minimum
is zero if f (α) ∈ K ∗ and −∞ otherwise (because if f (α) ∈ / K ∗ we can find a
⊤ ⊤
vector x ∈ K such that f (α) x < 0, and even if f (α) x is very close to zero,
we can find a very large α > 0 and drive f (α)⊤ (αx) to −∞).
Let us consider a conic programming problem
(PC ) minimize
n
c⊤ x,
x∈R
subject to Ax = b, x ∈ K,
(DC ) maximize
m
b⊤ α,
α∈R
subject to c − A⊤ α ∈ K ∗ ,
where K ∗ is the dual cone of K. The derivation of (DC ) (and some generaliza-
tion) is given in Appendix A.
Now we rewrite the primal problem (P) as a conic programming problem as
follows:
∑
m ∑
(P2 ) minimize ξi + λ ug ,
w∈Rn ,ξ∈Rm ,ξ̃∈Rm ,ug ∈R(g∈G)
i=1 g∈G
3
By defining
x = (ξ ⊤ , ξ̃ ⊤ , u⊤ , w⊤ )⊤ ,
c = (1m ⊤ , 0m ⊤ , λ1|G| ⊤ , 0m ⊤ )⊤ ,
1 −1 y (1) x(1)⊤
.. .. ..
A= . . 0 . ,
⊤
1 −1 (m) (m)
y x
b = 1m ,
we notice that (P2 ) is a conic programming problem. In fact, the cone K can
be written as
{ }
K = (ξ⊤ , ξ̃⊤ , u⊤ , w⊤ ) ∈ R2m+n+|G| : ξ ≥ 0, ξ̃ ≥ 0, ug ≥ ∥wg ∥2 (∀g ∈ G) .
Note that K is self dual; i.e., K ∗ = K. Accordingly the dual of (P2 ) can be
written as follows:
∑
m
(D2 ) maximize αi ,
i=1
subject to 1m − α ≥ 0m ,
0m + α ≥ 0m ,
°m °
°∑ °
° (i) °
λ≥° αi y xi ° (∀g ∈ G).
° °
i=1 2
(P3 ) minimize
n
f (Aw) + g(w),
w∈R
4
where
∑
m
f (z) = (1 − zi )+ ,
i=1
∑
g(w) = λ ∥wg ∥2 ,
g∈Gf
y (1) x(1)⊤
..
A= .
y (m) x(m) .
Using Fenchel’s duality theorem, the dual problem of (P3 ) can be written as
follows:
(D′3 ) minimize
m
− f ∗ (−α) − g ∗ (A⊤ α).
α∈R
5
∗
f (−αi) g∗(yg)
α =0 α =1 ||y ||<=λ
g
i i
Figure 1: The shapes of convex conjugate functions f ∗ (−α) and g ∗ (y) in 1D.
Next we show that the above lower bound is tight. In fact, if ∥y g ∥2 ≤ λ, we have
( )
y g ⊤ wg ≤ λ∥wg ∥2 (Cauchy-Schwarz inequality), which implies y g ⊤ wg − λ∥wg ∥2 ≤
0.
Finally, substituting the above f ∗ and g ∗ into (D′3 ), we obtain the following
dual problem.
( )
if 0 ≤ αi ≤ 1 (i = 1, .°. . , m),
∑m α
°∑
i=1 i ° m (i) °
(D3 ) maximize and ° i=1 αi y (i) xg ° ≤ λ (∀g ∈ G),
2
−∞ (otherwise).
Note that dual problem (D3 ) with the above f ∗ and g ∗ are equivalent to
both (D1 ) and (D2 ).
Figure 1 shows the rough shape of conjugate functions f ∗ (−α) and g ∗ (y).
References
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge
University Press, 2004.
R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
6
The dual function d(α) is obtained by minimizing the Lagrangian L(x, α)
with respect to x as follows:
( )
d(α) = inf c⊤ x + (b − Ax)⊤ α
x∈K
= b⊤ α + inf (c − A⊤ α)⊤ x.
x∈K
subject to Ax = b,
z = x, x ∈ K.
The Lagrangian L(w, z, α) of (P′′C ) can be written as follows:
L(w, z, α) = f (z) + α⊤ (b − Ax) + β ⊤ (x − z),
where α ∈ Rm and β ∈ Rn are Lagrangian multipliers.
The dual function d(α) can be obtained by minimizing L(w, z, α) with re-
spect to w and z as follows:
( )
d(α) = inf f (z) + α⊤ (b − Ax) + β ⊤ (x − z)
w,z
( ) ( )
= b⊤ α + infm f (z) − β ⊤ z + inf n β − A⊤ α ⊤ w
z∈R w∈R
( ) ( )
= b⊤ α − sup β ⊤ z − f (z) + inf n β − A⊤ α ⊤ w
z∈Rm w∈R
( )
⊤ ∗
= b α − f (β) + inf n β − A α ⊤ w,
⊤
w∈R
where f ∗ is the convex conjugate of f . Note that the minimization with respect
to w takes a finite value zero if and only if β − A⊤ α ∈ K ∗ (otherwise d(α) =
−∞).
7
Accordingly, the dual problem is written as follows:
(D′′C ) maximize b⊤ α − f ∗ (β),
α∈Rm ,β∈Rn
subject to β − A⊤ α ∈ K ∗ .
subject to Ax = z.
The Lagrangian L(x, z, α) of the equality constrained problem (PF ) can be
written as follows:
L(x, z, α) = f (z) + g(x) + α⊤ (z − Ax).
Minimizing the Lagrangian L(x, z, α) with respect to x and z we obtain the
dual function d(α) as follows:
( )
d(α) = inf f (z) + g(x) + α⊤ (z − Ax)
x,z
( ) ( )
= inf f (z) + α⊤ z + inf g(x) − α⊤ Ax
z x
( ) ( )
= − sup (−α)⊤ z − f (z) − sup (A⊤ α)⊤ x − g(x)
z x
= −f ∗ (−α) − g ∗ (A⊤ α)
If both f and g are convex, (PF ) satisfies Slater’s condition [Boyd and Van-
denberghe, 2004] and the strong duality holds. Therefore we have Eq. (1).
¤
Fenchel’s duality theorem can be generalized to the following problem
(P′F ) minimize f (Ax) + g(Bx),
x
subject to A⊤ α = B ⊤ β.
If B = I n (identity matrix), the above duality is equivalent to Fenchel’ duality
theorem.