Sparse Bayesian Learning - Analysis and Applications
Sparse Bayesian Learning - Analysis and Applications
Applications
Huanhuan Chen
1 / 57
Outline
1 Introduction
3 Experimental Analysis
5 Conclusion
2 / 57
Outline
1 Introduction
3 Experimental Analysis
5 Conclusion
3 / 57
What is Bayesian Inference?
Bayesian inference: a method of inference using Bayesian' rule to
incorporate likelihood and our belief (prior) distributions with
proper model selection.
P (D |w )P (w )
P (w |D ) =
P (D )
5 / 57
What is sparse model?
w
z }| {
f X w1
z }| { z }| { 0
f1 x11 x12 x13 x14 x15 ··· x1p
0
f2 x21 x22 x23 x24 x25 ··· x2p
· w4
··· = ··· ··· ··· ··· ··· ··· ···
0
fN xN 1 xN 2 xN 3 xN 4 xN 5 ··· xN p
···
wp
6 / 57
How to generate sparsity in sparse Bayesian learning?
Sparseness generating prior encourages sparseness:
1.5
P(w)
0.5
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
w
7 / 57
A Regression Example: Parametric Bayesian Solution
Given a training set D = (x n , yn )Nn=1 , x n ∈ R p , yn ∈ R .
Likelihood: training mean square error (MSE) assuming
zero-mean Gaussian noise
P (D |w) = (2πσ2 )−N /2 exp −
1 2
2σ 2 kf (xn ; w)−yn k ,
Prior: regularization term
N
Y
P (w|α) = N (wi |0, αi−1 ),
n=0
Posterior: the optimized weight vector
N
X X p
max log P (w |D ) ∝ min (f (x n ; w ) − yn )2 + αi wi2
w w
n=1 i =1
Sparse Learning
ŵ = arg minw R (w ) + λg (w )
R (w )− Likelihood (cost) function, e.g. MSE, cross entropy,
etc
PN
arg maxw log P (w |D ) ∝ minw R (w ) + n=1 αn g (wn )
Parameter αn is equivalent to the trade-o parameter λ.
9 / 57
The benets of Bayesian Inference
10 / 57
Ignore the normalization term, or not?
P (D |w )P (w |α)
P (w |D ) =
P (D |α)
11 / 57
How to automatically select model in SBL?
P (α|D ) ∝ P (D |α).
12 / 57
Iteratively optimize posterior and evidence in SBL
13 / 57
What are the critical problems in SBL?
14 / 57
Rethinking of two questions in SBL
15 / 57
Outline
1 Introduction
3 Experimental Analysis
5 Conclusion
16 / 57
Support Vector Machines: Margin Maximisation
Formulation
N !
X
f (x ; w ) = sign yn wn K (x , x n ) + b
n=1
18 / 57
Analysis of Support Vector Machines
Advantages
Good generalization
Disadvantages
19 / 57
Relevance Vector Machine
A Bayesian treatment of a generalized linear model
N !
X
f (x ; w ) = σ wn φn (x ) + b ,
n=1
where σ(·) is the sigmoid function for probabilistic outputs.
RVM is a Bayesian linear model with sparse prior on weights w
p(wn |αn ) = N (wn |0, αn 1 ). −
1.8
1.6
1.4
1.2
P(w)
0.8
0.6
0.4
0.2
0
−1 −0.5 0 0.5 1
w
20 / 57
Analysis of RVM
Advantages
probabilistic output
Disadvantages
21 / 57
Some Discussions on Voting and Learning
Kw+bI
−2
−4
−6
1
0.5
−0.2
0 0
0.2
0.4
−0.5 0.6
0.8
1
−1 1.2
23 / 57
Unstable RVM with respect to kernel parameter (Gaussian kernel)
RVM, σ =0.5, vectors =7, error=9.9% SVM, σ =0.5, C =10, vectors =94, error=9.4% PCVM, σ =0.5, vectors =5, error=9.4%
1.6 1.6 1.6
1 1 1
0 0 0
RVM, σ =0.3, vectors =243, error=12.6% SVM, σ =0.3, C =10, vectors =98, error=9.7% PCVM, σ =0.3, vectors =4, error=8.5%
1.6 1.6 1.6
1 1 1
0 0 0
RVM, σ =0.5, vectors =16, error=11.6327% SVM, σ =0.5, C =10, vectors =104, error=11.4286% PCVM, σ =0.5, vectors =15, error=11.551%
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
2
2 2
1
1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
25 / 57
Theoretical Analysis (Chen09)
26 / 57
Discussions
27 / 57
Probabilistic Classication Vector Machines
N !
X
y (x ; w ) = σ yn wn φn (x ) + b ,
n=1
Left-truncated Gaussian Prior on wn for non-negative wn
p(wn |αn ) = 2N (wn |0, αn−1 ) if wn ≥ 0
0 otherwise
Hyper-parameters: parameter αn .
28 / 57
Truncated Prior for PCVM
0.4
0.35
0.3
0.25
p(wi|αi)
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
wi
29 / 57
PCVM Formulation
Non-negative Prior
Y N
p(w |α) = N (w0 |0, α0 1 ) 2N (wi |0, αi−1 ) · δ(wi )
−
i =1
where δ(·) is the indicator function 1x ≥0 (x ).
Bernoulli Likelihood
YN
p(t | w ) = σit [1 − σi ]1−t ,
i i
n=1
P
N w φ (x ) , t = (t , t , · · · , t )T
where σn = σ n=0 n n n 1 2 N is a vector
y +1
of targets, ti =
2 ∈ {0, 1} is the probabilistic target.
i
30 / 57
Derivations
31 / 57
Solutions
Hidden variables
Integral Approximation
32 / 57
Case Study using Laplace Approximation
33 / 57
Posterior of weight vector
w MAP = A−1 ΦT (t − σ) + k
ΣMAP = (ΦT B Φ + A + D )−1 .
P
N y w φ (x ) ,
where σi = σ n=0 n n n i
A = diag (α0 , α1 , · · · , αN ),
D = diag (0, d1 , · · · , dN ) =
diag (0, σ(β w1 )(1 − σ(β w1 ))β 2 , · · · , σ(β wN )(1 − σ(β wN ))β 2 )
k = [0, β(1 − σ(β w1 )), · · · , β(1 − σ(β wN ))]T is the N + 1
vector, aiming to ensure that weights wi are non-negative.
34 / 57
Ecient PCVM by sequentially maximize model evidence
35 / 57
Outline
1 Introduction
3 Experimental Analysis
5 Conclusion
36 / 57
MCMC vs. Laplace Approximation
50
50
4
4
40 40
second component
second component
2 2
30 30
0 0
20 −2 20
−2
−4 10 −4 10
−6 −6
−6 −4 −2 0 2 4 6 −2 0 2 4 6 8 10 12
first component first component
37 / 57
MCMC, EP and Laplace Approximation
Generalization Error
0.16 0.24
0.22
0.14
0.2
0.12
0.18
0.1 0.16
0.14
0.08
−2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10
CPU Time (seconds) CPU Time (seconds)
38 / 57
MCMC, EP and Laplace Approximation
39 / 57
Synthetic Data Sets
1 1 1
0 0 0
−1 −1 −1
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
40 / 57
Synthetic Data Sets
0 0 0
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
1 1 1
0 0 0
−1 −1 −1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
41 / 57
Setup for Benchmark Tests
42 / 57
Summary of Benchmark Data Sets
43 / 57
Benchmark Results
1.01
PCVM
PCVM 1
1
0.98
0.99
SMLR SMLR
0.98 SVM 0.96
0.97
0.94
RVM RVM
0.96
0.92
SVM
0.95 knn
QDA
0.9
0.94
QDA
LDA 0.88
0.93 LDA
knn
0.92 0.86
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PCVM achieves the best performance with error rate and AUC.
44 / 57
Scalability
12000 0.17
fast PCVM fast PCVM
SVMlight 0.168 SVMlight
SMLR
10000 SMLR
0.166 RVM
RVM
0.164
8000
CPU Time (s)
0.162
0.158
4000
0.156
0.154
2000
0.152
0 0.15
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
number of trainning points number of trainning points
(o) CPU time on Adult Data Set (p) Error Rate on Adult Data Set
Figure : Comparison of CPU time and the err rate of fast PCVM, SVM,
SMLR and RVM on Adult data set.
45 / 57
Analysis
RVM and SMLR do not scale well with increased data points.
46 / 57
Outline
1 Introduction
3 Experimental Analysis
5 Conclusion
47 / 57
Rademacher Complexity Bound
Rademacher complexity measures richness of a class of
real-valued functions.
s
r r g̃ (q) + 1 ln 1
g̃ (q ) ln logr
g 2 δ
P (yf (x , q ) < 0) ≤ Remp [f , D ]+
2 2
+ ,
s N N
where Remp is the empirical loss,
N
X
ls (yn f (x n , q )),
1
Remp [f , D ] = and g̃ (q ) = r ·max(KL(q ||p), g ),
N n=1
where KL(q ||p) is the Kullback-Leibler divergence from the
posterior q to the prior p over parameters w .
48 / 57
KL Divergence between Prior and Posterior
49 / 57
Kullback-Leibler Divergence Between Prior and Posterior
− w α̂ /2
i i
,
i ,w 6=0
i i
− ln erfc − w 2α̂
i
i i
where
50 / 57
KL divergence between Truncated Posterior and Gaussian prior
30
25
KL Divergence
20
15
10
0
10
5
4
5
wi 3
2
1 α̂i
0 0
51 / 57
Sparseness and the Bound
g̃ (q ) = r · max(KL(q ||p), g ),
52 / 57
Outline
1 Introduction
3 Experimental Analysis
5 Conclusion
53 / 57
Conclusion
54 / 57
For Further Reading 1
55 / 57
For Further Reading 2
56 / 57
Demo and Thank you!
57 / 57