0% found this document useful (0 votes)

26 views

Convex Optimization in Classification Problems: MIT/ORC Spring Seminar

The document discusses convex optimization techniques for classification problems. It outlines connections between classification and linear/convex quadratic programming, and how recent advances in convex optimization allow solving large classes of classification problems. Specific techniques covered include support vector machines (SVMs), robust linear programming, and the minimax probability machine (MPM). The MPM frames classification as a second-order cone program that finds the largest robust separation between two classes of data. Experimental results demonstrate the MPM achieves good performance on benchmark datasets.

Uploaded by

Patrick O'Rourke

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Convex Optimization in Classification Problems: MIT/ORC Spring Seminar

Uploaded by

Patrick O'Rourke

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

MIT/ORC Spring seminar

Convex Optimization in Classification Problems

Laurent El Ghaoui

Department of EECS, UC Berkeley

[email protected]

1
goal

• connection between classification and LP, convex QP has a long

history (Vapnik, Mangasarian, Bennett, etc)
• recent progresses in convex optimization: conic and semidefinite
programming; geometric programming; robust optimization
• we’ll outline some connections between convex optimization and
classification problems

joint work with: M. Jordan, N. Cristianini, G. Lanckriet,

C. Bhattacharrya

2
outline

convex optimization
• SVMs and robust linear programming
• minimax probability machine
• learning the kernel matrix

3
convex optimization

standard form:

min f0 (x) : fi (x) ≤ 0, i = 1, . . . , m

• arises in many applications

• convexity not always recognized in practice
• can solve large classes of convex problems in polynomial-time
(Nesterov, Nemirovski, 1990)

4
conic optimization

special class of convex problems:

min cT x : Ax = b, x ∈ K
x

where K is a cone, direct product of the following ”building blocks”:

K = Rn+ linear programming

K = {(y, t) ∈ Rn+1 : t ≥ kyk2 } second-order cone programming,
quadratic programming
K = {x ∈ Rn×n : x = xT 0} semidefinite programming

fact: can solve conic problems in polynomial-time

(Nesterov, Nemirovski, 1990)

5
conic duality

dual of conic problem

min cT x : Ax = b, x ∈ K
x

is
max bT y : c − AT y ∈ K ∗
y

where
K ∗ = {z : hz, xi ≥ 0 ∀ x ∈ K}
is the cone dual to K

for the cones mentioned before, and direct products of them, K = K ∗

6
robust optimization

conic problem in dual form: maxy bT y : c − AT y ∈ K

→ what if A is unknown-but-bounded, say A ∈ A, where A is given?

robust counterpart: maxy bT y : ∀A ∈ A, c − AT y ∈ K

• still convex, but tractability depends on A

• systematic ways to approximate (get lower bounds)
• for large classes of A, approximation is exact

7
example: robust LP

linear program: minx cT x : aTi x ≤ b, i = 1, . . . , m

assume ai ’s are unknown-but-bounded in ellipsoids

T −1

Ei := a : (a − âi ) Γi (a − âi ) ≤ 1

where âi : center, Γi 0: ”shape matrix”

robust LP: minx cT x : ∀ ai ∈ Ei , aTi x ≤ b, i = 1, . . . , m

8
robust LP: SOCP representation
robust LP equivalent to
1/2
min cT x : âTi x + kΓi xk2 ≤ b, i = 1, . . . , m
x

→ a second-order cone program!

interpretation: smoothes boundary of feasible set

9
LP with Gaussian coefficients

assume a ∼ N (â, Γ), then for given x,

Prob{aT x ≤ b} ≥ 1 −

is equivalent to:
âT x + κkΓ1/2 xk2 ≤ b
where κ = Φ−1 (1 − ) and Φ is the c.d.f. of N (0, 1)

hence,

• can solve LP with Gaussian coefficients using second-order

cone programming
• resulting SOCP is similar to one obtained with ellipsoidal
uncertainty

10
LP with random coefficients

assume a ∼ (â, Γ), i.e. distribution of a has mean â and covariance

matrix Γ, but is otherwise unknown

Chebychev inequality:

Prob{aT x ≤ b} ≥ 1 −

is equivalent to:
âT x + κkΓ1/2 xk2 ≤ b
where r
1−
κ=

leads to SOCP similar to ones obtained previously

11
outline

• convex optimization
SVMs and robust linear programming
• minimax probability machine
• kernel optimization

12
SVMs: setup

given data points xi with labels yi = ±1, i = 1, . . . , N

two-class linear classification with support vector:

min kak2 : yi (aT xi − b) ≥ 1, i = 1, . . . , N

• problem is feasible iff there exists a separating hyperplane between

the two classes
• if so, amounts to select one separating hyperplane among the
many possible

13
SVMs: robust optimization interpretation

interpretation: SVMs are a way to handle noise in data points

• assume each data point is unknown-but-bounded in a sphere of
radius ρ and center xi
• find the largest ρ such that separation is still possible between the
two classes of perturbed points

14
variations

can use other data noise models:

• hypercube uncertainty (gives rise to LP)

• ellipsoidal uncertainty (→ QP)
• probabilistic uncertainty, Gaussian or Chebychev (→ QP)

15
separation with hypercube uncertainty
assume each data point is unknown-but-bounded in an hypercube Ci :

xi ∈ Ci := {x̂i + ρP u : kuk∞ ≤ 1}

where centers x̂i and ”shape matrix” P are given

robust separation:

leads to linear program

min kP ak1 : yi (aT x̂i − b) ≥ 1, i = 1, . . . , N

16
separation with ellipsoidal uncertainty

assume each data point is unknown-but-bounded in an ellipsoid Ei :

xi ∈ Ei := {x̂i + ρP u : kuk2 ≤ 1}

where center x̂i and ”shape matrix” P are given

robust separation leads to QP

min kP ak2 : yi (aT x̂i − b) ≥ 1, i = 1, . . . , N

17
outline

• convex optimization
• SVMs and robust linear programming
minimax probability machine
• kernel optimization

18
minimax probability machine

goal:
• make assumptions about the data generating process
• do not assume Gaussian distributions
• use second-moment analysis of the two classes

let x̂± , Γ± be the mean and covariance matrix of class y = ±1

MPM: maximize such that there exists (a, b) such that

inf Prob{aT x ≤ b} ≥ 1−

x∼(x̂+ ,Γ+ )

inf Prob{aT x ≥ b} ≥ 1−

x∼(x̂− ,Γ− )

19
MPMs: optimization problem

→ two-sided, multivariable Chebychev inequality:

T 2
(b − a x̂) +
inf Prob{aT x ≤ b} =
x∼(x̂,Γ) (b − aT x̂)2+ + aT Γa

MPM problem leads to second-order cone program:

1/2 1/2
min kΓ+ ak2 + kΓ− ak2 : aT (x̂+ − x̂− ) = 1
a

complexity is the same as standard SVMs

20
dual problem

express problem as unconstrained min-max problem:

1/2 1/2
min max uT Γ+ a − v T Γ− a + λ(1 − aT (x+ − x− ))
a kuk2 ≤1, kvk2 ≤1

exchange min and max, and set ρ := 1/λ:

1/2 1/2
min ρ : x+ + Γ+ u = x− + Γ− v, kuk2 ≤ ρ, kvk2 ≤ ρ
ρ,u,v

geometric interpretation: define the two ellipsoids

n o
1/2
E± (ρ) := x̂± + Γ± u : kuk2 ≤ ρ

and find largest ρ for which ellipsoids intersect

21
robust optimization interpretation

assume data with label + generated arbitrarily in ellipsoid

n o
1/2
x+ ∈ E+ (ρ) := x̂+ + Γ+ u : kuk2 ≤ ρ

and similarly for data with label −

MPM finds largest ρ for which robust separation is possible

aT x − b = 0
x̂+
PSfrag replacements
x̂−

22
experimental results

Linear kernel Gaussian kernel BPB

α TSA α TSA
Twonorm 80.2 % 95.2 % 86.2 % 95.8 % 96.3 %
Breast cancer 84.4 % 97.2 % 92.7 % 97.3 % 96.8 %
Ionosphere 63.3 % 84.7 % 94.9 % 93.4 % 93.7 %
Pima diabetes 31.2 % 73.8 % 33.0 % 74.6 % 76.1 %
Heart 42.8 % 81.4 % 50.5 % 83.6 % unknown

23
variations

• minimize weighted sum of misclassification probabilities

• quadratic separation: find a quadratic set such that

inf Prob{x ∈ Q} ≥ 1 −
x∼(x̂+ ,Γ+ )

inf Prob{x 6∈ Q} ≥ 1 −
x∼(x̂− ,Γ− )

→ leads to a semidefinite programming problem

• nonlinear classification via kernels
(using plug-in estimates of mean and covariance matrix)

24
outline

• convex optimization
• SVMs and robust linear programming
• minimax probability machine
learning the kernel matrix

25
transduction

the data contains both

• labeled points (training set)
• unlabeled points (test set)

transduction: given labeled training set and unlabeled test set,

predict the labels on the test set

26
kernel methods

main goal: separate using a nonlinear classifier

aT φ(x) = b

where φ is a nonlinear operator

define the kernel matrix

Kij = φ(xi )T φ(xj )

(involves both labeled and unlabeled data)

fact: for transduction, all we need to know to predicts the labels is

the kernel matrix (and not φ(·) itself!)

27
kernel methods: idea of proof
at the optimum, a is in the range of the labeled data:
X
a= λ i xi
i

=⇒ solution of classification problem depends only on the values of

kernel matrix Kij for labeled points xi , xj

in a transductive setting, the prediction of labels also involves Kij

only, since for an unlabeled data point xj ,
X
T
a φ(xj ) = λi φ(xi )T φ(xj )
i

involves only Kij ’s

fact: all previous algorithms can be ”kernelized”

28
partition training/test

in a transductive setting, we can partition the kernel matrix as follows:

 
Ktr,tr Ktr,t
K= 
T
Ktr,t Kt,t

where subscripts tr and t stand for ”training” and ”test”, respectively

29
kernel optimization

what is a ”good” kernel?

• margin: kernel ”behaves well” on the training data,

→ condition on the matrix Ktr,tr
• test error: kernel yields low predicted error
→ condition on the full matrix K
• also, to prevent overfitting, the blocks in K should be ”entangled”
→ will restrict the search space with affine constraint

30
kernel optimization and semidefinite programming

main idea: kernel can be described via the Gram matrix of data points,
hence is a positive semidefinite matrix

→ semidefinite programming plays a role in kernel optimization

31
margin of SVM classifier

kernel-based SVM problem for labeled data points:

min kak2 subject to yi (aT φ(xi ) + b) ≥ 1, i = 1, . . . , N

a,b

(classifier depends only on training set block of kernel matrix Ktr )

margin of optimal classifier is γ = 1/ka∗ k2

geometrically:

γ −1 = distance between the convex hulls of the two classes

(can work with ”soft” margin when data is not linearly separable)

32
generalization error

how well the SVM classifier will work on the test set?
from learning theory (Bartlett, Rademacher), generalization error is
bounded above by
√
Tr K
γ(Ktr )
where γ(Ktr ) is the margin of the SVM classifier with training set
block kernel matrix Ktr
hence, the constraints

Tr K = c, γ(Ktr )−1 ≤ w

ensure an upper bound on the generalization error

33
margin constraint

using a dual expression for the SVM problem, margin constraint

γ(Ktr ) ≥ γ

writes as LMI (linear matrix inequality) in K

 
G(Ktr ) e+ν+λ·y
 0
(e + ν + λ · y)T γ −1

where
• G(Ktr ) is linear in Ktr
• e = vector of ones
• λ, ν are new variables

34
avoiding overfitting

the trace constraint Tr K = c is not enough to ”entangle” the matrix

K
we impose an affine constraint on K of the form
X
K= µi Ki
i

where Ki ’s correspond to given different, known kernels and µi ’s will

be our new variables

35
optimizing kernels: example problem

goal: find a kernel matrix that

• is positive semidefinite and has a given trace

K 0, Tr K = c

• belongs to an affine space (here Ki ’s are known)

X
K= µi Ki
i

• satisfies a lower bound γ on the margin on the training set, γ(Ktr )

the problem reduces to a semidefinite programming feasibility problem

36
experimental results

K1 K2 K3 K∗
Breast cancer d=2 σ = 0.5
margin 0.010 0.136 - 0.300
TSE 19.7 28.8 11.4
Sonar d=2 σ = 0.1
margin 0.035 0.198 0.006 0.352
TSE 15.5 19.4 21.9 13.8
Heart d=2 σ = 0.5
margin - 0.159 - 0.285
TSE 49.2 36.6

37
wrap-up

• convex optimization has much to offer and gain from interaction

with classification
• described variations on linear classification
• many robust optimization interpretations
• all these methods can be kernelized
• kernel optimization has high potential

38
see also

• Learning the kernel matrix with semidefinite programming

(Lanckiert, Cristianini, Bartlett, El Ghaoui, Jordan, submitted to
ICML 2002)
• Minimax probability machine
(Lanckiert, Bhattacharrya, El Ghaoui, Jordan) (NIPS 2001)

Retreats Ora Grodsky and Jeremy Phillips
No ratings yet
Retreats Ora Grodsky and Jeremy Phillips
11 pages
Measuring Value in The Public Sector
No ratings yet
Measuring Value in The Public Sector
6 pages
' - Magic: Recovery of Sparse Signals Via Convex Programming
No ratings yet
' - Magic: Recovery of Sparse Signals Via Convex Programming
19 pages
Linear Programming Duality
No ratings yet
Linear Programming Duality
17 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
admin,+6.+Tran+Van+Thang
No ratings yet
admin,+6.+Tran+Van+Thang
6 pages
05 1 Optimization Methods NDP
No ratings yet
05 1 Optimization Methods NDP
85 pages
hpc_iterative
No ratings yet
hpc_iterative
106 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Equality Constrained Optimization: Daniel P. Robinson
No ratings yet
Equality Constrained Optimization: Daniel P. Robinson
33 pages
Inverse Robust Linear Optimisation
No ratings yet
Inverse Robust Linear Optimisation
1 page
Acceleration Methods: Aitken's Method
No ratings yet
Acceleration Methods: Aitken's Method
5 pages
tut 9s(updated)
No ratings yet
tut 9s(updated)
6 pages
INAIO_Stage_2_Sample_Problems_MLTheory
No ratings yet
INAIO_Stage_2_Sample_Problems_MLTheory
6 pages
Lecture RandomizedLA
No ratings yet
Lecture RandomizedLA
6 pages
lect6_removed
No ratings yet
lect6_removed
31 pages
Exercise 1
No ratings yet
Exercise 1
5 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
leastsquares_minnorm_problems
No ratings yet
leastsquares_minnorm_problems
6 pages
2023_hw3sol
No ratings yet
2023_hw3sol
14 pages
Minors_2023.PDF
No ratings yet
Minors_2023.PDF
8 pages
Signal Processing MCQs (1)
No ratings yet
Signal Processing MCQs (1)
16 pages
LMI Methods in Optimal and Robust Control
No ratings yet
LMI Methods in Optimal and Robust Control
31 pages
05 Lecture - ILP-and-duality
No ratings yet
05 Lecture - ILP-and-duality
8 pages
Announcements Estimating and Improving Accuracy
No ratings yet
Announcements Estimating and Improving Accuracy
3 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
Indian Institute of Technology, Bombay Chemical Engineering Cl603, Optimization Endsem, 27 April 2018
No ratings yet
Indian Institute of Technology, Bombay Chemical Engineering Cl603, Optimization Endsem, 27 April 2018
3 pages
convex-fns-scribed
No ratings yet
convex-fns-scribed
6 pages
Mathematical Programming: Is Called The Feasible Set and
No ratings yet
Mathematical Programming: Is Called The Feasible Set and
12 pages
Math 0602339
No ratings yet
Math 0602339
4 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
JSSM20090100005 27988653
No ratings yet
JSSM20090100005 27988653
7 pages
Convex Optimization and System Theory: Kees Roos/A.A. Stoorvogel E-Mail: Url
No ratings yet
Convex Optimization and System Theory: Kees Roos/A.A. Stoorvogel E-Mail: Url
39 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
Nonlinear Programming Methods.S1 Separable Programming: Problem Statement
0% (1)
Nonlinear Programming Methods.S1 Separable Programming: Problem Statement
10 pages
Notes 3
No ratings yet
Notes 3
8 pages
Numerical_Methods_Formula_Sheet
No ratings yet
Numerical_Methods_Formula_Sheet
1 page
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
SoftFRAC Matlab Library For Realization
No ratings yet
SoftFRAC Matlab Library For Realization
10 pages
Chapter_4
No ratings yet
Chapter_4
27 pages
CS 323 Computer Problem: Solution of Nonlinear Equations Due February 24, 2011
No ratings yet
CS 323 Computer Problem: Solution of Nonlinear Equations Due February 24, 2011
3 pages
National Institute of Technology Calicut
No ratings yet
National Institute of Technology Calicut
1 page
Sparse Optimization Lecture: Basic Sparse Optimization Models
No ratings yet
Sparse Optimization Lecture: Basic Sparse Optimization Models
33 pages
Buy ebook Graph Algorithms and Applications I 1st Edition Roberto Tamassia cheap price
No ratings yet
Buy ebook Graph Algorithms and Applications I 1st Edition Roberto Tamassia cheap price
86 pages
Successive Quadratic Programming
No ratings yet
Successive Quadratic Programming
33 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
ConvexSpring25_Week4
No ratings yet
ConvexSpring25_Week4
23 pages
NLP Lecture Note
No ratings yet
NLP Lecture Note
9 pages
Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
No ratings yet
Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
81 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
simon
No ratings yet
simon
4 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Mpc12 Exam
No ratings yet
Mpc12 Exam
8 pages
wainwrightslides1
No ratings yet
wainwrightslides1
67 pages
Assignment 2 Matlab RegProbs
No ratings yet
Assignment 2 Matlab RegProbs
4 pages
Cusvm: A Cuda Implementation of Support Vector Classification and Regression
No ratings yet
Cusvm: A Cuda Implementation of Support Vector Classification and Regression
9 pages
Hw3sol PDF
No ratings yet
Hw3sol PDF
8 pages
Unit 7: Cutting Plane Algorithms
No ratings yet
Unit 7: Cutting Plane Algorithms
28 pages
EE5121: Convex Optimization: Assignment 6
No ratings yet
EE5121: Convex Optimization: Assignment 6
4 pages
LS Paper
No ratings yet
LS Paper
18 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
Exercises INF 5860: Exercise 1 Linear Regression
No ratings yet
Exercises INF 5860: Exercise 1 Linear Regression
5 pages
Solutions Part I - Logistic Regression Backpropagation With A Single Training Example
No ratings yet
Solutions Part I - Logistic Regression Backpropagation With A Single Training Example
6 pages
Kaggle Machine Learning Projects Ashok Kumar Harnal: FORE School of Management, New Delhi
No ratings yet
Kaggle Machine Learning Projects Ashok Kumar Harnal: FORE School of Management, New Delhi
52 pages
The LION Way: Roberto Battiti Mauro Brunato
No ratings yet
The LION Way: Roberto Battiti Mauro Brunato
257 pages
Gradient Descent Regression Logistic Regression
No ratings yet
Gradient Descent Regression Logistic Regression
14 pages
Exercises INF 5860 Solution Hints
No ratings yet
Exercises INF 5860 Solution Hints
11 pages
CSE3009 - PARALLEL-AND-DISTRIBUTED-COMPUTING - LTP - 1.0 - 8 - Parallel and Distributed Computing
No ratings yet
CSE3009 - PARALLEL-AND-DISTRIBUTED-COMPUTING - LTP - 1.0 - 8 - Parallel and Distributed Computing
2 pages
2018-19 - Sydney Opera House Annual Report - LR Spreads
No ratings yet
2018-19 - Sydney Opera House Annual Report - LR Spreads
108 pages
Byte Con Fiden Tial Don Otc Opy: Model Name: Ga-H110M-S2H
No ratings yet
Byte Con Fiden Tial Don Otc Opy: Model Name: Ga-H110M-S2H
50 pages
Going Aloft / Working Over Side Permit: General
No ratings yet
Going Aloft / Working Over Side Permit: General
2 pages
Preservation of Boilers
80% (5)
Preservation of Boilers
4 pages
Cebu T-121 (ND) 051723
No ratings yet
Cebu T-121 (ND) 051723
18 pages
Dynamics of Rigid Bodies Part 1 (Edited)
No ratings yet
Dynamics of Rigid Bodies Part 1 (Edited)
1 page
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
No ratings yet
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
5 pages
Digital Textbooks Listing - 17K Pivot Subjects
No ratings yet
Digital Textbooks Listing - 17K Pivot Subjects
1,182 pages
Tos Educ154-Template
No ratings yet
Tos Educ154-Template
2 pages
Makalah Inggris
No ratings yet
Makalah Inggris
19 pages
Block Poster
No ratings yet
Block Poster
2 pages
KPMG UC How To Analyze A Case
No ratings yet
KPMG UC How To Analyze A Case
2 pages
Banking Chapter 5
No ratings yet
Banking Chapter 5
11 pages
DENON AVR11 Tech Tips Version 2
No ratings yet
DENON AVR11 Tech Tips Version 2
44 pages
A. LP-In-Matter (Properties of Matter)
No ratings yet
A. LP-In-Matter (Properties of Matter)
13 pages
Ac Panasonic PDF
No ratings yet
Ac Panasonic PDF
16 pages
Etag 001-5 Option 1 Asset Doc Approval 0158
No ratings yet
Etag 001-5 Option 1 Asset Doc Approval 0158
23 pages
2018 Dart Ascap Publishers
No ratings yet
2018 Dart Ascap Publishers
3,979 pages
Work Life Balance HCL Tech Shows The Way With EFC
No ratings yet
Work Life Balance HCL Tech Shows The Way With EFC
1 page
Pe Lab Manual 2016
No ratings yet
Pe Lab Manual 2016
44 pages
The Ceecec Handbook
No ratings yet
The Ceecec Handbook
533 pages
Inbound Logistics. The Majority of Samsung Suppliers Are Based in Asia and Accordingly
No ratings yet
Inbound Logistics. The Majority of Samsung Suppliers Are Based in Asia and Accordingly
4 pages
Edu 103 - Instructional Technology
No ratings yet
Edu 103 - Instructional Technology
10 pages
Technical Analysis: DR - Manish Dadhich Mba, Net, Set
No ratings yet
Technical Analysis: DR - Manish Dadhich Mba, Net, Set
51 pages
How To Read P&Ids: Dave Harrold, Senior Editor Control Engineering August 1, 2000
No ratings yet
How To Read P&Ids: Dave Harrold, Senior Editor Control Engineering August 1, 2000
6 pages
GE Lighting Systems Indoor Lighting Designers Guide 1970
100% (1)
GE Lighting Systems Indoor Lighting Designers Guide 1970
56 pages
IBM Tivoli Storage Manager For Mail Version 6.3.0
No ratings yet
IBM Tivoli Storage Manager For Mail Version 6.3.0
86 pages