0% found this document useful (0 votes)
117 views

Data Science - Convex Optimization and Examples PDF

1. The document discusses convex optimization and its applications in data science. Some key applications mentioned include spam detection, gene expression analysis, credit scoring, and recommendation problems. 2. It provides background on convex optimization, defining convex sets and functions. Examples of convex functions include exponential, affine, entropy, and quadratic forms. Convexity is desirable because convex problems have local minimizers that are also global, and fast algorithms exist to solve convex optimization problems. 3. Standard convex optimization procedures are then discussed, including descent methods like gradient descent and stochastic gradient descent. Convexity allows local information like the gradient to solve global optimization problems.

Uploaded by

Tuan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Data Science - Convex Optimization and Examples PDF

1. The document discusses convex optimization and its applications in data science. Some key applications mentioned include spam detection, gene expression analysis, credit scoring, and recommendation problems. 2. It provides background on convex optimization, defining convex sets and functions. Examples of convex functions include exponential, affine, entropy, and quadratic forms. Convexity is desirable because convex problems have local minimizers that are also global, and fast algorithms exist to solve convex optimization problems. 3. Standard convex optimization procedures are then discussed, including descent methods like gradient descent and stochastic gradient descent. Convexity allows local information like the gradient to solve global optimization problems.

Uploaded by

Tuan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1 Data Science - Convex optimization and application

n : number of observations - p : number of variables per observations


Data Science - Convex optimization and
application p >> n >> O(1).

Summary
1.2 Several examples
We begin by some illustrations in challenging topics in modern data
science. Then, this session introduces (or reminds) some basics on Spam detection From a set of labelled messages (spam or not), build a clas-
optimization, and illustrate some key applications in supervised clas- sification for automatic spam rejection.
sification.

1 Data Science
1.1 What is data science :
Extract from data some knowledge for industrial or academic exploitation.
It generally involves :

1. Signal Processing (how to record the data and represent it ?)


2. Modelisation (What is the problem, what kind of mathematical model
and answer ?)
3. Statistics (reliability of estimation procedures ?)
4. Machine Learning (what kind of efficient optimization algorithm ?)
5. Implementation (software needs)
6. Visualization (how can I represent the resulting knowledge ?)

In its whole, this sequence of questions are at the core of Artificial Intel- ● Select among the words meaningful elements ?
ligence and may also be referred to as Computer Science problems. In this ● Automatic classification ?
lecture, we will address some issues raised in red items. Each time, practical
examples will be provided

Most of our motivation comes from the Big Data world, encountered in
image processing, finance, genetics and many other fields where knowledge Gene expression profiles analysis One measures micro-array datasets built
extraction is needed, when facing many observations described by many va- from a huge amount of profile genes expression. Number of genes p (of order
riables. thousands). Number of samples n (of order hundred).
2 Data Science - Convex optimization and application

Credit scoring Build an indicator (Q score) from a dataset for the probability
of interest in a financial product (Visa premier credit card).

Diagnostic help : healthy or ill ?


● Select among the genes meaningful elements ?
● Automatic classification ?

Recommandation problems

1. Define a model, a question ?


2. Use a supervised classification algorithm to rank the best clients.
And more recently :
3. Use logistic regression to provide a score.

1.3 What about maths ?


Various mathematical fields we will talk about :
● Analysis : Convex optimization, Approximation theory
● Statistics : Penalized procedures and their reliability
● Probabilistic methods : concentration inequalities, stochastic processes,
stochastic approximations
Famous keywords :
● Lasso
● Boosting
● What kind of database ? ● Convex relaxation
● Reliable recommandation for clients ? ● Supervised classification
● Online strategy ? ● Support Vector Machine
3 Data Science - Convex optimization and application

●Aggregation rules 2.2 Example of convex functions


●Gradient descent
●Stochastic Gradient descent
●Sequential prediction
●Bandit games, minimax policies
●Matrix completion
In this session : We will slightly talk about optimization, that are mainly
convex in our statistical worl. Non-convex problems are also very interesting
even though much more difficult to deal with from a numerical point of view.

2 Standard Convex optimisation procedures


● Exponential function : θ ∈ R z→ exp(aθ) on R whatever a is.
2.1 Convex functions ● Affine function : θ ∈ Rd z→ at θ + b
● Entropy function : θ ∈ R+ z→ −θ log(θ)
We recall some background material that is necessary for a clear unders- ¿
Á d
● p-norm : θ ∈ Rd z→ ∣θ∥p ∶= Á
À
tanding of how some machine learning procedures work. We will cover some
∑ ∥θi ∣ with p ≥ 1.
p p
basic relationships between convexity, positive semidefiniteness, local and glo- i=1
bal minimizers. ● Quadratic form : θ ∈ Rd z→ θ t P θ + 2q t θ + r where P is symetric and
positive.
D ÉFINITION 1. — [Convex sets, convex functions] A set D is convex if and
only if for any (x1 , x2 ) ∈ D2 and all α ∈ [0, 1], 2.3 Why such an interest in convexity ?
x = αx1 + (1 − α)x2 ∈ D. From external motivations :
● Many problems in machine learning come from the minimization of a
A function f is convex if convex criterion and provide meaningful results for the statistical initial
● its domain D is convex
task.
● f (x) = f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 ).
● Many optimization problems admit a convex reformulation (SVM clas-
sification or regression, LASSO regression, ridge regression, permuta-
D ÉFINITION 2. — [Positive Semi Definite matrix (PSD)] A p × p matrix H is
tion recovery, . . . ).
(PSD) if for all p × 1 vectors z, we have z t Hz ≥ 0.
From a numerical point of view :
There exists a strong link between SDP matrix and convex functions, given ● Local minimizer = global minimizer. It is a powerful point since in ge-
by the following proposition. neral, descent methods involve ∇f (x) (or something related to), which
is a local information on f .
P ROPOSITION 3. — A smooth C 2 (D) function f is convex if and only if D2 f ● x is a local (global) minimizer of f if and only if 0 ∈ ∂f (x).
is SDP at any point of D. ● Many fast algorithms for the optimization of convex function exist, and
sometimes are independent on the dimension d of the original space.
The proof follows easily from a second order Taylor expansion.
4 Data Science - Convex optimization and application

2.4 Why convexity is powerful ? X ⊂ Rd . The most famous local descent method relies on
Two kinds of optimization problems : yt+1 = xt − ηgt where gt ∈ ∂f (xt ),

and
xt+1 = ΠX (yt+1 ),
where η > 0 is a fixed step-size parameters.

● On the left : non convex optimization problem, use of Travelling Sales-


man type method. Greedy exploration step (simulated annealing, genetic
algortihms).
● On the right : convex optimization problem, use local descent methods
with gradients or subgradients.

D ÉFINITION 4. — [Subgradient (nonsmooth functions ?)] For any function f ∶


Rd Ð→ R, and any x in Rd , the subgradient ∂f (x) is the set of all vectors g
such that
f (x) − f (y) ≤ ⟨g, x − y⟩.

This set of subgradients may be empty. Fortunately, it is not the case for convex
T HÉORÈME 6. — [Convergence of the projected gradient descent method,
functions.
fixed step-size] If f is convex over X with X ⊂ B(0, R) and ∥∂f ∥∞ ≤ L,

P ROPOSITION 5. — f ∶ Rd Ð→ R is convex if and only if ∂f (x) ≠ ∅ for any the choice η = L t leads to
R

x of Rd .
1 t RL
f ( ∑ xs ) − min f ≤ √
3 Gradient descent method t s=1 t

3.1 Projected descent 3.2 Smooth unconstrained case


In either constrained or unconstrained problems, descent methods are po- Results can be seriously improved with smooth functions with bounded se-
werful with convex functions. In particular, consider constrained problems in cond derivatives.
5 Data Science - Convex optimization and application

D ÉFINITION 7. — f is β smooth if ∇f is β Lipschitz :

∥∇f (x) − ∇f (y)∥ ≤ β∥x − y∥.

Standard gradient descent over Rd becomes

xt+1 = xt − η∇f (xt ),


T HÉORÈME 8. — [Convergence of the gradient descent method, β smooth Ω : circle of radius 2
function] If f is a convex and β-smooth function, then η = β1 leads to Optimal solution : θ = (−1, −1)t and J(θ⋆ ) = −2.

Important restriction : we will restrict our study to convex functions J.


1 t 2β∥x1 − x0 ∥2
f ( ∑ xs ) − min f ≤ D ÉFINITION 9. — A constrained problem is convex iff
t s=1 t−1
● J is a convex function
● fi are linear or affine functions and gi are convex functions
Remarque. —
● Note that the two past results do not depend on the dimension of the state
space d.
● The last result can be extended to the constrained situation.

3.3 Constrained case


Elements of the problem :
● θ unknown vector of Rd to be recovered
● J ∶ Rd ↦ R function to be minimized
● fi and gi differentiable functions defining a set of constraints.
Definition of the problem :
Example
● minθ∈Rd J(θ) such that :
min J(θ) such that at θ − b = 0
● fi (θ) = 0, ∀i = 1, . . . , n and gi (θ) ≤ 0, ∀i = 1, . . . , m θ
Set of admissible vectors : ● Descent direction h : ∇J(θ)t h < 0.
● Admissible direction h : at (θ + h) − b = 0 ⇐⇒ at h = 0.
Ω ∶= {θ ∈ Rd ∣ fi (θ) = 0, ∀i and gj (θ) ≤ 0, ∀j} Optimality θ∗ is optimal if there is no admissible descent direction starting
from θ∗ . The only possible case is when ∇J(θ∗ ) and a are linearly dependent :
Typical situation : ∃λ ∈ R ∇J(θ∗ ) + λa = 0.
6 Data Science - Convex optimization and application

3.5 Inequality constraint


Case of a unique inequality constraint :
min J(θ) such that g(θ) ≤ 0
θ

● Descent direction h : ∇J(θ)t h < 0.


● Admissible direction h : ∇g(θ)t h ≤ 0 guarantees that g(θ + αh) is de-
creasing with α.
Optimality θ∗ is optimal if there is no admissible descent direction starting
from θ∗ . The only possible case is when ∇J(θ∗ ) and ∇g(θ∗ ) are linearly de-
In this situation :
pendent and opposite :
2θ + θ2 − 2
∇J(θ∗ ) = −µ∇g(θ∗ )
1
∇J(θ) = ( 1 ) and a=( ) ∃λ ∈ R with µ ≥ 0.
θ1 + 2θ2 + 2 −1
Hence, we are looking for θ such that ∇J(θ) ∝ a. Computations lead to θ1 =
−θ2 . Optimal value reached for θ1 = 1/2 (and J(θ∗ ) = −15/4).

3.4 Lagrangian function

min J(θ) such that f (θ) ∶= at θ − b = 0


θ
We have seen the important role of the scalar value λ above.

We can check that θ∗ = (−1, −1).


D ÉFINITION 10. — [Lagrangian function]
L(λ, θ) = J(θ) + λf (θ)
3.5.1 Lagrangian in general settings
λ is the Lagrange multiplier. The optimal choice of (θ∗ , λ∗ ) corresponds to We consider the minimization problem :
∗ ∗ ∗ ∗ ● minθ J(θ) such that
∇θ L(λ , θ ) = 0 and ∇λ L(λ , θ ) = 0.
● gj (θ) ≤ 0, ∀j = 1, . . . , m and fi (θ) = 0, ∀i = 1, . . . , n
Argument : θ∗ is optimal if there is no admissible descent directions h.
Hence, ∇J and ∇f are linearly dependent. As a consequence, there exists λ D ÉFINITION 11. — [Lagrangian function] We associate to this problem the
such that Lagrange multipliers (λ, µ) = (λ1 , . . . , λn , µ1 , . . . , µm ).
n m
∇θ L(λ∗ , θ∗ ) = ∇J(θ) + λ∇f (θ) = 0 (Dual equation) L(θ, λ, µ) = J(θ) + ∑ λi fi (θ) + ∑ µj gj (θ)
i=1 j=1
Since θ must be admissible, we have
● θ primal variables
∇θ L(λ∗ , θ∗ ) = f (θ∗ ) = 0 (Primal equation) ● (λ, µ) dual variables
7 Data Science - Convex optimization and application

3.5.2 KKT Conditions ● The dual function L is lower than p∗ , for any (λ, µ) ∈ Rn × Rm
+
● We aim to make this lower bound as close as possible to p∗ : idea to
D ÉFINITION 12. — [KKT Conditions] If J and f, g are smooth, we define the maximize w.r.t. λ, µ the function L.
Karush-Kuhn-Tucker (KKT) conditions as
● Stationarity : ∇θ L(λ, µ, θ) = 0. D ÉFINITION 15. — [Dual problem]
● Primal Admissibility : f (θ) = 0 and g(θ) ≤ 0.
max L(λ, µ).
● Dual admissibility µj ≥ 0, ∀j = 1, . . . , m. λ∈Rn ,µ∈Rm
+

T HÉORÈME 13. — A convex minimization problem of J under convex L(θ, λ, µ) affine function on λ, µ and thus convex. Hence, L is convex and
constraints f and g has a solution θ∗ if and only if there exists λ∗ and µ∗ almost unconstrained.
such that KKT conditions hold. ● Dual problems are easier than primal ones (because of almost constraints
omissions).
Example : ● Dual problems are equivalent to primal ones : maximization of the dual
⇔ minimization of the primal (not shown in this lecture).
1 ● Dual solutions permit to recover primal ones with KKT conditions (La-
J(θ) = ∥θ∥22 s.t. θ1 − 2θ2 + 2 ≤ 0 grange multipliers).
2
∥θ∥2
We get L(θ, µ) = 2 2 + µ(θ1 + 2θ2 + 2) with µ ≥ 0.
Stationarity : (θ1 + µ, θ2 − 2µ) = 0.

θ2 = −2θ1 with θ2 ≤ 0.
We deduce that θ∗ = (−2/5, 4/5).
3.5.3 Dual function
We introduce the dual function :
Example :
L(λ, µ) = min L(θ, λ, µ).
θ 2 +θ 2
θ
● Lagrangian : L(θ, µ) = 1 2 2 + µ(θ1 − 2θ2 + 2).
We have the following important result ● Dual function L(µ) = minθ L(θ, µ) = − 52 µ2 + 2µ.
● Dual solution : max L(µ) such that µ ≥ 0 : µ = 2/5.
T HÉORÈME 14. — Denote the optimal value of the constrained problem p∗ = ● Primal solution : KKT Ô⇒ θ = (−µ, 2µ) = (−2/5, 4/5).
min {J(θ)∣f (θ) = 0, g(θ) ≤ 0}, then To obtain further details, see the Minimax von Neuman’s Theorem . . .

L(λ, µ) ≤ p∗ . 3.6 Take home message from convex optimization


● Big Data problems arise in a large variety of fields. They are complicated
Remark : for a computational reason (and also for a statistical one, see later).
8 Data Science - Convex optimization and application

● Many Big Data problems will be traduced in an optimization of a convexin the 4/5 pages of the report.
problem.
● Efficient algorithms are available to optimize them : The report files should be named lastname.doc or lastname.pdf and expected
independently on the dimension of the underlying space. in my mailbox before 8th of February.
● Primal - Dual formulations are important to overcome some constraints
on the optimization. And to do this, anything is fair game (you can do what you want and find
● Numerical convex solvers are widely and freely distributed.
sources everywhere, but take care to avoid a plagiat !)

4.1 Classification with NN & SVM


4 Applications & Homework
The supervised classification problem is a long-standing issue in statistics
Length limitation : 5 pages ! and machine learning and many algorithms can be found to deal with this
Deadline : 8th of February. standard framework. After a brief introduction and a concrete example, a mo-
Group of 2 students allowed. delisation of this statistical problem, explain the important role of the Bayes
classifier and of the NN rule. Then, present the geometric interpretation of the
● This report should be short : strictly less than 5 pages, including the
SVM classifier, the role of convexity and the maths behind. After, discuss on
references.
the influence of the several parameters : number of observations, dimension of
● The work relies either on an academic widespread subject or on a group
the ambiant space, etc.
of selected papers. In any case, you have to highlight the relationship
between the concerned chapter and the theme you selected.
For the chosen subject, the report should be organized as follows
1. First motivate the problem with a concrete application and propose a
reasonnable modelisation.
2. Second, the report should explain the mathematical difficulties to solve
the model and some recent developments to bypass these difficulties.
You can also describe the behaviour of some algorithms.
3. Third, the report should propose either :
● numerical simulations using packages found on the www or your
own experiments.
● some sketch of proofs of baseline theoretical results References :
● a discussion part that present alternative methods (with references), ● CRAN repository
exposing pros and cons of each methods. ● Journal of Statistical Software webpage
You can choose to only exploit a subsample of the proposed references, ● Hastie Tibshirani and Friedman, The elements of statistical learning
as soon as the content of your work is interesing enough. You can also data mining inference and prediction
complement your report with a reproducible set of simulations (use R or ● Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition
Matlab please) that can be inspired from existing packages. (If packages are ● My website perso.math.univ-toulouse.fr/gadat/
not public, send the whole source files). These simulations are not accounted ● Wikistat wikistat.fr/
9 Data Science - Convex optimization and application

4.2 Transport problems

Optimal transportation problem is a growing field of interest in machine lear-


ning, big data and statistics. You will find below some interesting readings. Try References :
to mainly understand the motivations, the mathematical tools, and the nature ● Cran repository : lpsolve and Transport packages
of the several applications. You can also find some softwares. ● Francis Bach’s webpage (Matlab softwares)
References ● Alexandre d’Aspremont’s webpage
● Cran repository : lpsolve and Transport packages ● Fogel,Jenatton, Bach, D’Aspremont Convex relaxations for permutation
● Gabriel Peyré’s webpage (Matlab softwares) problems
● Mario Cuturi’s webpage ● Lim and Wright Beyond the Birkhoff Polytope : Convex Relaxations for
● Nicolas Papadakis’s webpage Vector Permutation Problems
● Benamou, Carlier, Cuturi, Nenna and Peyré Iterative Bregman Projec- ● Collier and Dalalyan Minimax rates in permutation estimation for fea-
tions for Regularized Transportation Problems ture matching
● Cuturi and Peyré A smoothed dual approach for variationnal Wasser-
stein problems
● Cuturi Sinkhorn Distances : Lightspeed Computation of Optimal Trans-
port
● Cuturi and Doucet Fast Computation of Wasserstein Barycenters

4.3 Permutation recovery


The statistical recovery of a permutation is a perfect example of NP-hard
problem, for which a non trivial convex relaxation should be studied. You can
either propose to focus on global optimization with simulated annealing or ge-
netic algorithm, or convex methods for solving relaxed convex problems. The
problem of permutation recovery is useful in seriation, graphs, . . . . Instead of
focusing on the statistical part, focus on the optimization problem, the principle
of the relaxation and the potential applications.

You might also like