Convex Optimization in Classification Problems: MIT/ORC Spring Seminar
Convex Optimization in Classification Problems: MIT/ORC Spring Seminar
Laurent El Ghaoui
1
goal
2
outline
convex optimization
• SVMs and robust linear programming
• minimax probability machine
• learning the kernel matrix
3
convex optimization
standard form:
4
conic optimization
min cT x : Ax = b, x ∈ K
x
5
conic duality
min cT x : Ax = b, x ∈ K
x
is
max bT y : c − AT y ∈ K ∗
y
where
K ∗ = {z : hz, xi ≥ 0 ∀ x ∈ K}
is the cone dual to K
6
robust optimization
7
example: robust LP
8
robust LP: SOCP representation
robust LP equivalent to
1/2
min cT x : âTi x + kΓi xk2 ≤ b, i = 1, . . . , m
x
9
LP with Gaussian coefficients
Prob{aT x ≤ b} ≥ 1 −
is equivalent to:
âT x + κkΓ1/2 xk2 ≤ b
where κ = Φ−1 (1 − ) and Φ is the c.d.f. of N (0, 1)
hence,
10
LP with random coefficients
Chebychev inequality:
Prob{aT x ≤ b} ≥ 1 −
is equivalent to:
âT x + κkΓ1/2 xk2 ≤ b
where r
1−
κ=
leads to SOCP similar to ones obtained previously
11
outline
• convex optimization
SVMs and robust linear programming
• minimax probability machine
• kernel optimization
12
SVMs: setup
13
SVMs: robust optimization interpretation
14
variations
15
separation with hypercube uncertainty
assume each data point is unknown-but-bounded in an hypercube Ci :
xi ∈ Ci := {x̂i + ρP u : kuk∞ ≤ 1}
16
separation with ellipsoidal uncertainty
xi ∈ Ei := {x̂i + ρP u : kuk2 ≤ 1}
17
outline
• convex optimization
• SVMs and robust linear programming
minimax probability machine
• kernel optimization
18
minimax probability machine
goal:
• make assumptions about the data generating process
• do not assume Gaussian distributions
• use second-moment analysis of the two classes
19
MPMs: optimization problem
20
dual problem
21
robust optimization interpretation
aT x − b = 0
x̂+
PSfrag replacements
x̂−
22
experimental results
23
variations
inf Prob{x ∈ Q} ≥ 1 −
x∼(x̂+ ,Γ+ )
inf Prob{x 6∈ Q} ≥ 1 −
x∼(x̂− ,Γ− )
24
outline
• convex optimization
• SVMs and robust linear programming
• minimax probability machine
learning the kernel matrix
25
transduction
26
kernel methods
aT φ(x) = b
27
kernel methods: idea of proof
at the optimum, a is in the range of the labeled data:
X
a= λ i xi
i
28
partition training/test
29
kernel optimization
30
kernel optimization and semidefinite programming
main idea: kernel can be described via the Gram matrix of data points,
hence is a positive semidefinite matrix
31
margin of SVM classifier
geometrically:
(can work with ”soft” margin when data is not linearly separable)
32
generalization error
how well the SVM classifier will work on the test set?
from learning theory (Bartlett, Rademacher), generalization error is
bounded above by
√
Tr K
γ(Ktr )
where γ(Ktr ) is the margin of the SVM classifier with training set
block kernel matrix Ktr
hence, the constraints
Tr K = c, γ(Ktr )−1 ≤ w
33
margin constraint
γ(Ktr ) ≥ γ
where
• G(Ktr ) is linear in Ktr
• e = vector of ones
• λ, ν are new variables
34
avoiding overfitting
35
optimizing kernels: example problem
K 0, Tr K = c
36
experimental results
K1 K2 K3 K∗
Breast cancer d=2 σ = 0.5
margin 0.010 0.136 - 0.300
TSE 19.7 28.8 11.4
Sonar d=2 σ = 0.1
margin 0.035 0.198 0.006 0.352
TSE 15.5 19.4 21.9 13.8
Heart d=2 σ = 0.5
margin - 0.159 - 0.285
TSE 49.2 36.6
37
wrap-up
38
see also
39