Sparse Coding and Dictionary Learning For Image Analysis: Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro
Sparse Coding and Dictionary Learning For Image Analysis: Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 1/41
What is a Sparse Linear Model?
Let x in Rm be a signal.
1
minp ||x − Dα||22 + λψ(α)
α∈R
|2 {z } | {z }
sparsity-inducing
data fitting term regularization
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 3/41
Finding your way in the sparse coding literature. . .
. . . is not easy. The literature is vast, redundant, sometimes
confusing and many papers are claiming victory. . .
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 4/41
1 Greedy Algorithms
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 5/41
Matching Pursuit
1: α←0
2: r ← x (residual).
3: while ||α||0 < L do
4: Select the atom with maximum correlation with the residual
α[ı̂] ← α[ı̂] + dT
ı̂ r
r ← r − (dT
ı̂ r)dı̂
6: end while
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 6/41
Matching Pursuit α = (0, 0, 0)
y
d2
d1
z r
d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 7/41
Matching Pursuit α = (0, 0, 0)
y
d2
d1
z r
d3
< r, d3 > d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 8/41
Matching Pursuit α = (0, 0, 0)
y
d2
d1
r
z r− < r, d3 > d3
d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 9/41
Matching Pursuit α = (0, 0, 0.75)
y
d2
d1
z
r
d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 10/41
Matching Pursuit α = (0, 0, 0.75)
y
d2
d1
z
r
d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 11/41
Matching Pursuit α = (0, 0, 0.75)
y
d2
d1
z
r
d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 12/41
Matching Pursuit α = (0, 0.24, 0.75)
y
d2
d1
d3
r
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 13/41
Matching Pursuit α = (0, 0.24, 0.75)
y
d2
d1
d3
r
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 14/41
Matching Pursuit α = (0, 0.24, 0.75)
y
d2
d1
d3
r
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 15/41
Matching Pursuit α = (0, 0.24, 0.65)
y
d2
d1
d3
r
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 16/41
Orthogonal Matching Pursuit
1: Γ = ∅.
2: for iter = 1, . . . , L do
3: Select the atom which most reduces the objective
′ 2
ı̂ ← arg min min
′
||x − DΓ∪{i } α ||2
i ∈ΓC α
7: end for
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 17/41
Orthogonal Matching Pursuit α = (0, 0, 0)
y
d2 Γ=∅
d1
z x
d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 18/41
Orthogonal Matching Pursuit α = (0, 0, 0.75)
y Γ = {3}
d2
π2 d1
x
z r
π1
d3
π3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 19/41
Orthogonal Matching Pursuit α = (0, 0.29, 0.63)
y Γ = {3, 2}
d2
d1
x
z π 32 r
π 31 d3
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 20/41
Orthogonal Matching Pursuit
Contrary to MP, an atom can only be selected one time with OMP. It is,
however, more difficult to implement efficiently. The keys for a good
implementation in the case of a large number of signals are
Precompute the Gram matrix G = DT D once in for all,
Maintain the computation of DT r for each signal,
Maintain a Cholesky decomposition of (DT −1 for each signal.
Γ DΓ )
The total complexity for decomposing n L-sparse signals of size m with a
dictionary of size p is
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 21/41
Example with the software SPAMS
Software available at http://www.di.ens.fr/willow/SPAMS/
>> I=double(imread(’data/lena.png’))/255;
>> %extract all patches of I
>> X=im2col(I,[8 8],’sliding’);
>> %load a dictionary of size 64 x 256
>> D=load(’dict.mat’);
>>
>> %set the sparsity parameter L to 10
>> param.L=10;
>> alpha=mexOMP(X,D,param);
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 22/41
Why does the ℓ1 -norm induce sparsity?
Analysis of the norms in 1D
ψ(α) = |α|
ψ(α) = α2
ψ ′ (α) = 2α
′ (α) = −1, ψ ′ (α) = 1
ψ− +
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 23/41
Why does the ℓ1 -norm induce sparsity?
Exemple: quadratic problem in 1D
1
min (x − α)2 + λ|α|
α∈R 2
Derivative at 0+ : g+ = −x + λ and 0− : g− = −x − λ.
α⋆ = sign(x)(|x| − λ)+ .
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 24/41
Why does the ℓ1 -norm induce sparsity?
Physical illustration
E1 = 0 E1 = 0
x0
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 25/41
Why does the ℓ1 -norm induce sparsity?
Physical illustration
k1
E1 = 2 (x0 − x)2
k1
E1 = 2 (x0 − x)2
E2 = mgx
k2 2 x
E2 = 2x
x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 26/41
Why does the ℓ1 -norm induce sparsity?
Physical illustration
k1
E1 = 2 (x0 − x)2
k1
E1 = 2 (x0 − x)2
k2 2 x
E2 = 2x
x = 0 !! E2 = mgx
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 27/41
Why does the ℓ1 -norm induce sparsity?
The geometric explanation
x y
y x
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 28/41
Optimality conditions of the Lasso
Nonsmooth optimization
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 29/41
Optimality conditions of the Lasso
Directional derivatives
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 30/41
Optimality conditions of the Lasso
1
minp ||x − Dα||22 + λ||α||1
α∈R 2
α⋆ is optimal iff for all u in Rp , ∇f (α, u) ≥ 0—that is,
X X
−uT DT (x − Dα⋆ ) + λ sign(α⋆ [i ])u[i ] + λ |ui | ≥ 0,
i ,α[i ]6=0 i ,α⋆ [i ]=0
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 31/41
Homotopy
A homotopy method provides a set of solutions indexed by a
parameter.
The regularization path (λ, α⋆ (λ)) for instance!!
It can be useful when the path has some “nice” properties
(piecewise linear, piecewise quadratic).
LARS [Efron et al., 2004] starts from a trivial solution, and follows
the regularization path of the Lasso, which is is piecewise linear.
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 32/41
Homotopy, LARS
[Osborne et al., 2000], [Efron et al., 2004]
|dT ⋆
i (x − Dα )| ≤ λ if α⋆ [i ] = 0
∀i = 1, . . . , p, T
di (x − Dα ) = λ sign(α⋆ [i ])
⋆ if α⋆ [i ] 6= 0
(1)
The regularization path is piecewise linear:
DT ⋆ ⋆
Γ (x − DΓ αΓ ) = λ sign(αΓ )
−1
α⋆Γ (λ) = (DT T ⋆
Γ DΓ ) (DΓ x − λ sign(αΓ )) = A + λB
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 33/41
Example with the software SPAMS
http://www.di.ens.fr/willow/SPAMS/
>> I=double(imread(’data/lena.png’))/255;
>> %extract all patches of I
>> X=normalize(im2col(I,[8 8],’sliding’));
>> %load a dictionary of size 64 x 256
>> D=load(’dict.mat’);
>>
>> %set the sparsity parameter lambda to 0.15
>> param.lambda=0.15;
>> alpha=mexLasso(X,D,param);
⇒ soft-thresholding!
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 35/41
Example with the software SPAMS
http://www.di.ens.fr/willow/SPAMS/
>> I=double(imread(’data/lena.png’))/255;
>> %extract all patches of I
>> X=normalize(im2col(I,[8 8],’sliding’));
>> %load a dictionary of size 64 x 256
>> D=load(’dict.mat’);
>>
>> %set the sparsity parameter lambda to 0.15
>> param.lambda=0.15;
>> param.tol=1e-2;
>> param.itermax=200;
>> alpha=mexCD(X,D,param);
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 36/41
first-order/proximal methods
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 37/41
Summary of this part
Greedy methods can address directly the NP-hard ℓ0 -decomposition
problem.
ℓ1 can be used as a convex relaxation for ℓ0 .
Homotopy methods can be extremely efficient for small or
medium-sized problems, or when the solution is very sparse.
Coordinate descent provides in general quickly a solution with a
small/medium precision, but gets slower when there is a lot of
correlation in the dictionary.
First order methods are very attractive in the large scale setting.
Other good alternatives exists, active-set, reweighted ℓ2 methods,
stochastic variants, variants of OMP,. . .
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 38/41
References I
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
D. P. Bertsekas. Nonlinear programming. Athena Scientific Belmont, Mass, 1999.
J.F. Bonnans, J.C. Gilbert, C. Lemarechal, and C.A. Sagastizabal. Numerical
optimization: theoretical and practical aspects. Springer-Verlag New York Inc,
2006.
J. M. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: Theory
and examples. Springer, 2006.
S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for
linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math, 57:
1413–1457, 2004.
I. Daubechies, R. DeVore, M. Fornasier, and S. Gunturk. Iteratively re-weighted least
squares minimization for sparse recovery. Commun. Pure Appl. Math, 2009.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of
statistics, 32(2):407–499, 2004.
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 39/41
References II
J. Friedman, T. Hastie, H. Hölfling, and R. Tibshirani. Pathwise coordinate
optimization. Annals of statistics, 1(2):302–332, 2007.
W. J. Fu. Penalized regressions: The bridge versus the Lasso. Journal of
computational and graphical statistics, 7:397–416, 1998.
S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE
Transactions on Signal Processing, 41(12):3397–3415, 1993.
H. M. Markowitz. The optimization of a quadratic function subject to linear
constraints. Naval Research Logistics Quarterly, 3:111–133, 1956.
Y. Nesterov. Gradient methods for minimizing composite objective function.
Technical report, CORE, 2007.
Y. Nesterov. A method for solving a convex programming problem with convergence
rate O(1/k 2 ). Soviet Math. Dokl., 27:372–376, 1983.
J. Nocedal and SJ Wright. Numerical Optimization. Springer: New York, 2006. 2nd
Edition.
M. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and its dual. Journal of
Computational and Graphical Statistics, 9(2):319–37, 2000.
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 40/41
References III
V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of
solutions and efficient algorithms. In Proceedings of the International Conference
on Machine Learning (ICML), 2008.
S. Weisberg. Applied Linear Regression. Wiley, New York, 1980.
Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Optimization for Sparse Coding 41/41