0% found this document useful (0 votes)

103 views57 pages

Sparse Bayesian Learning - Analysis and Applications

The document discusses sparse Bayesian learning (SBL), which uses sparse priors to encourage sparse solutions. It explains that SBL is equivalent to regularized regression with the prior as the regularization term. SBL allows for automatic model selection by maximizing the model evidence. However, calculating the model evidence is computationally intensive. The document outlines an iterative approach to optimize the posterior and evidence to perform SBL, but notes choosing appropriate distributions and optimization methods remain critical challenges.

Uploaded by

Danilo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views57 pages

Sparse Bayesian Learning - Analysis and Applications

Uploaded by

Danilo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Sparse Bayesian Learning: Analysis and

Applications

Huanhuan Chen

School of Computer Science and Technology

University of Science & Technology of China

VALSE, Nanjing, 2013

1 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

2 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

3 / 57
What is Bayesian Inference?
Bayesian inference: a method of inference using Bayesian' rule to
incorporate likelihood and our belief (prior) distributions with
proper model selection.

P (D |w )P (w )
P (w |D ) =
P (D )

w: is the weight vector of the model, e.g. weight vector in

neural networks. D is the observed data set.

prior P (w ): the probability of w before data D is observed.

This can be expert knowledge or preference about the model,
e.g. sparseness.

likelihood P (D |w ): the probability of observing data D given

w.
posterior P (w |D ): the probability of w after D is observed.

P (D ): marginal likelihood or model evidence. It is crucial for

model selection in Bayesian inference.
4 / 57
Parametric or Nonparametric Bayesian

Parametric Bayesian model: Prior on parameter with xed or

bounded number of parameters.

Prior on parameter: sparseness generating prior → sparse

model
Examples: Bayesian neural networks, Relevance Vector
Machine, Probabilistic Classication Vector Machine (PCVM),
etc.

Nonparametric Bayesian model: ∞-dimensional parameter

space

Prior on function → very exible models.

Not sparse and computational intensive: training O (N 3 ),
testing O (N 2 ).
Examples: Gaussian Processes, Dirichlet Processes, etc

This talk focuses on parametric/sparse Bayesian model.

5 / 57
What is sparse model?

In the estimated model f (X ; w ) = X w , if many weights, i.e.

wi = 0, are zero, the obtained mode is referred as sparse model.

w
z }| {
f X w1
z }| { z }| {  0 
f1 x11 x12 x13 x14 x15 ··· x1p  
 0 
 f2   x21 x22 x23 x24 x25 ··· x2p  
 
   ·  w4 
 ···  =  ··· ··· ··· ··· ··· ··· ···   
 0 
fN xN 1 xN 2 xN 3 xN 4 xN 5 ··· xN p  
 ··· 
wp

Sparsity → variable selection → model interpretability.

Sparsity → regularization → less overtting & better

prediction.

6 / 57
How to generate sparsity in sparse Bayesian learning?
Sparseness generating prior encourages sparseness:

P (w ) has the highest probability when w = 0.

The higher P (w ) at 0, more sparse.

Examples: Gaussian prior, Gaussian prior with hyperparameter

Gamma prior ( P (wi ) ∝ 1/ |wi |); Laplace prior · · ·
Probability density function of Prior
2.5
Gaussian Prior
Laplace Prior

1.5
P(w)

0.5

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
w

7 / 57
A Regression Example: Parametric Bayesian Solution
Given a training set D = (x n , yn )Nn=1 , x n ∈ R p , yn ∈ R .
Likelihood: training mean square error (MSE) assuming
zero-mean Gaussian noise

P (D |w) = (2πσ2 )−N /2 exp −
1 2
2σ 2 kf (xn ; w)−yn k ,
Prior: regularization term

N
Y
P (w|α) = N (wi |0, αi−1 ),
n=0
Posterior: the optimized weight vector

N
X X p
max log P (w |D ) ∝ min (f (x n ; w ) − yn )2 + αi wi2
w w
n=1 i =1

Maximization of posterior in Bayesian inference is equivalent to

regularized regression, with the prior as regularized term.
8 / 57
Relationship between Sparse Learning and Bayesian Inference

Sparse Learning

ŵ = arg minw R (w ) + λg (w )
R (w )− Likelihood (cost) function, e.g. MSE, cross entropy,
etc

g (w )− (prior) sparse regularization, e.g. l0 , l1 -lasso

Parameter λ need to be tuned by cross validation

Spare Bayesian Learning (SBL)

PN
arg maxw log P (w |D ) ∝ minw R (w ) + n=1 αn g (wn )
Parameter αn is equivalent to the trade-o parameter λ.

9 / 57
The benets of Bayesian Inference

Automatic model selection, i.e. regularization parameters αi

and (potential) kernel parameters, feature selection, ···, by
maximizing the model evidence P (D |α).
Expert knowledge or preferences of the models can be easily
incorporated into the model by prior distribution.

Probabilistic outputs with condence intervals (covariance

matrix)

10 / 57
Ignore the normalization term, or not?

P (D |w )P (w |α)
P (w |D ) =
P (D |α)

To simplify calculation, the normalization term

R
P (D |α) = P (D |w )P (w |α)d w is often ignored to save
calculation.

In fact, P (D |α) is crucial for automatic model selection, i.e.

automatically choose best αn .

11 / 57
How to automatically select model in SBL?

For best hyperparameter α after observing data D , we need to

maximize the posterior of P (α|D ):
P (D |α)P (α)
P (α|D ) =
P (D )
If an uniform prior P (α) is adopted. Then,

P (α|D ) ∝ P (D |α).

Maximization of evidence P (D |α) is to maximize the posterior

P (α|D ) of hyperparameter.

12 / 57
Iteratively optimize posterior and evidence in SBL

1 Initialization: choose an initial hyperparameter α value.

2 Posterior maximization: update the optimal weight vector w

by maximizing the posterior of weights P (w |D ) with previous
α.

3 Evidence maximization: update the hyperparameter α by

maximizing the evidence P (D |α).
4 Loop steps (2) and (3) until converged.

13 / 57
What are the critical problems in SBL?

Choose proper prior and likelihood distributions for specic

problems.

Eective optimization approaches to maximize the posterior of

parameters and model evidence: gradient based approaches,
coordinate descent, etc.
R
Posterior and evidence P (D |α) = P (D |w )P (w |α)d w are
important but often intractable if prior or likelihood is not
Gaussian!

Hidden variable solutions: Expectation Maximization. Pros:

simple derivations and implementations; cons: sensitive to
initializations, local minima.
Integral approximation techniques for analytical solutions:
Laplace approximation, Variational Bayesian, Expectation
Propagation (EP)

14 / 57
Rethinking of two questions in SBL

Is Gaussian prior appropriate for all problems?

Bayesian methods are the most powerful when your prior

adequately captures your beliefs. Improper prior yield
unreasonable inferences.

Gaussian prior used for several decades. Is it proper for

classication?

Does more sparsity mean better solutions?

More sparsity: simpler model, but might lack of freedom to

approximate the feature-label mapping.

15 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

16 / 57
Support Vector Machines: Margin Maximisation

The largest margin

SVM maximises the margin of two classes and try to minimise

the generalisation error.
The training points that are nearest to the separating function
are called support vectors. The model is immune to removal of
any non-support-vector data points.
17 / 57
Support Vector Machines

Formulation

SVM makes predictions based on the function:

N !
X
f (x ; w ) = sign yn wn K (x , x n ) + b
n=1

xn are training examples

K (x , xn ) is the kernel function
yn ∈ {−1, +1} is the label for xn
N is the total number of training examples
wn is non-negative Lagrange multiplier: wn is either zero or
positive.

18 / 57
Analysis of Support Vector Machines

Advantages

Good generalization

Sparse solution: some weights wn are 0.

Disadvantages

Non-probabilistic but hard binary decisions.

Kernel parameters and parameter C (control the error

tolerance) need to be tuned by cross validation: time
consuming.

19 / 57
Relevance Vector Machine
A Bayesian treatment of a generalized linear model

N !
X
f (x ; w ) = σ wn φn (x ) + b ,
n=1
where σ(·) is the sigmoid function for probabilistic outputs.
RVM is a Bayesian linear model with sparse prior on weights w
p(wn |αn ) = N (wn |0, αn 1 ). −

where αn is the inverse variance of Gaussian.

Probability density function of Gaussian Prior

1.8

1.6

1.4

1.2
P(w)

0.8

0.6

0.4

0.2

0
−1 −0.5 0 0.5 1
w
20 / 57
Analysis of RVM

Advantages

probabilistic output

sparser than SVM

Disadvantages

Some training points that belong to positive class (yn = +1)

may have negative weights and vice versa, leading to the
situation that the decision of RVM is based on some
untrustworthy vectors, and thus is sensitive to the kernel
parameter (even with well-selected kernel parameters.)

21 / 57
Some Discussions on Voting and Learning

In kernel methods, every point makes impacts to the decision

boundary.

In SVM, every point will either vote for decision domain by

class label, or do not vote.

In RVM, every point can vote for and against decision

domain.

Voting for or/and Against?

Any voting system permits some expression of disapproval,

but these are necessarily confused with expressions of choice or
approval, leading some to conclude that separating these
expressions is best. (wikipedia)

Is this the same in machine learning? 22 / 57

Illustration

Kw+bI

−2

−4

−6
1

0.5
−0.2
0 0
0.2
0.4
−0.5 0.6
0.8
1
−1 1.2

23 / 57
Unstable RVM with respect to kernel parameter (Gaussian kernel)
RVM, σ =0.5, vectors =7, error=9.9% SVM, σ =0.5, C =10, vectors =94, error=9.4% PCVM, σ =0.5, vectors =5, error=9.4%
1.6 1.6 1.6

1.4 1.4 1.4

1.2 1.2 1.2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1

RVM, σ =0.3, vectors =243, error=12.6% SVM, σ =0.3, C =10, vectors =98, error=9.7% PCVM, σ =0.3, vectors =4, error=8.5%
1.6 1.6 1.6

1.4 1.4 1.4

1.2 1.2 1.2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1

The used vectors ( wi 6= 0) whose weights have opposite signs

are shown circled.
More redundant vectors with small opposite signs might lead
to unstable solutions.
24 / 57
Unstable RVM with respect to kernel parameter

RVM, σ =0.5, vectors =16, error=11.6327% SVM, σ =0.5, C =10, vectors =104, error=11.4286% PCVM, σ =0.5, vectors =15, error=11.551%

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

RVM, σ =0.7, vectors =355, error=12.4694%

SVM, σ =0.7, C =10, vectors =101, error=11.0408% PCVM, σ =0.7, vectors =13, error=11.6327%
3
3 3

2
2 2

1
1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

25 / 57
Theoretical Analysis (Chen09)

Maximum-a-posterior (MAP) analysis: PCVM with truncated

priors has higher posterior than models with Gaussian priors.

Margin analysis: PCVM with truncated priors has larger

margin than models with Gaussian priors, especially with a
localized basis function.

26 / 57
Discussions

SVM always assign positive/negative weights to

positive/negative points. This principle is implemented in
SVM by enforcing the Lagrange multipliers to be non-negative.

How to combine the advantages of RVM and SVM and discard

the unstable characteristics?

27 / 57
Probabilistic Classication Vector Machines

Combine the advantages of SVM and RVM

N !
X
y (x ; w ) = σ yn wn φn (x ) + b ,
n=1
Left-truncated Gaussian Prior on wn for non-negative wn

p(wn |αn ) = 2N (wn |0, αn−1 ) if wn ≥ 0
0 otherwise

Hyper-parameters: parameter αn .

28 / 57
Truncated Prior for PCVM

0.4

0.35

0.3

0.25
p(wi|αi)
0.2

0.15

0.1

0.05

0
0 1 2 3 4 5 6
wi

The mean of truncated Gaussian prior is larger than zero.

It is less sparse than RVM with (hierarchical) student-t prior.

Question: more spareness = better generalization?

29 / 57
PCVM Formulation
Non-negative Prior

Y N
p(w |α) = N (w0 |0, α0 1 ) 2N (wi |0, αi−1 ) · δ(wi )
−

i =1
where δ(·) is the indicator function 1x ≥0 (x ).

Bernoulli Likelihood

YN
p(t | w ) = σit [1 − σi ]1−t ,
i i

n=1
P
N w φ (x ) , t = (t , t , · · · , t )T
where σn = σ n=0 n n n 1 2 N is a vector
y +1
of targets, ti =
2 ∈ {0, 1} is the probabilistic target.
i

30 / 57
Derivations

According to Bayesian's theorem, the posterior is:

p(t |w )p(w |α)

p(w |t ) = .
p(t |α)

The integral to calculate posterior

R p(w |t ) and model evidence
p(t |α) = p(t |w )p(w |α)d w are intractable due to the
truncated prior.

31 / 57
Solutions

Hidden variables

Expectation Maximization (EM): simple derivations,

simultaneously optimize kernel parameters, but sensitive to
initialization and may converge to local minima (Chen09)

Integral Approximation

Laplace Approximation: deterministic and fast, and the

performance is acceptable (veried by MCMC) (Chen13)

Expectation Propagation (EP): accurate but slow (Chen13)

Markov chain Monte Carlo (MCMC): the most accurate but

very slow (Chen13)

32 / 57
Case Study using Laplace Approximation

The most probable w , i.e. posterior, can be obtained by

maximizing the following log likelihood

Q = log {p(t |w )p(w |α)} − log p(t |α)

N
X N
X
αi wi2
1
= [ti log σi + (1 − ti ) log(1 − σi )] −
2
i =1 i =0
X N
+ log δ(wi ) − const.
i =1

33 / 57
Posterior of weight vector

Analyzing the rst/second gradient of the above equation and we

obtain the optimal value

w MAP = A−1 ΦT (t − σ) + k
ΣMAP = (ΦT B Φ + A + D )−1 .

P
N y w φ (x ) ,
where σi = σ n=0 n n n i

A = diag (α0 , α1 , · · · , αN ),
D = diag (0, d1 , · · · , dN ) =
diag (0, σ(β w1 )(1 − σ(β w1 ))β 2 , · · · , σ(β wN )(1 − σ(β wN ))β 2 )
k = [0, β(1 − σ(β w1 )), · · · , β(1 − σ(β wN ))]T is the N + 1
vector, aiming to ensure that weights wi are non-negative.

34 / 57
Ecient PCVM by sequentially maximize model evidence

Model evidence L(α) = P (t |α) can be written as

L(α) = L(α−i ) + l (αi ),
where

L(α−i ): the model evidence with basis function φi deleted.

l (αi ): the contribution of αi to evidence when include φi .
Analyzing each l (αi ) → sequentially maximize model evidence →
incremental PCVM.

35 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

36 / 57
MCMC vs. Laplace Approximation

PCA Analysis of MCMC (Metropolis−Hastings) and Laplace

PCA Analysis of MCMC (Metropolis−Hastings) and Laplace 8
8
MCMC Sampling Points
MCMC Sampling Points Laplace mean 60
Laplace mean 60 Gaussian Ellipse Contour by Laplace
Gaussian Ellipse Contour by Laplace 6
6

50
50
4
4

40 40
second component

second component
2 2

30 30
0 0

20 −2 20
−2

−4 10 −4 10

−6 −6
−6 −4 −2 0 2 4 6 −2 0 2 4 6 8 10 12
first component first component

(m) Synth (n) Heart

Figure : The posteriors of combination weights calculated by MCMC

(40000 sampling points) and Laplace Approximation.

37 / 57
MCMC, EP and Laplace Approximation

Posterior Mean Posterior Mean

0.2
Laplace Laplace
EP 0.28 EP
HMC HMC
0.18
0.26
Generalization Error

Generalization Error
0.16 0.24

0.22
0.14

0.2
0.12
0.18

0.1 0.16

0.14
0.08
−2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10
CPU Time (seconds) CPU Time (seconds)

(a) Synth (b) Heart

Figure : The comparisons of Laplace approximation, expectation

propagation and hybrid monte carlo (200,0000 sampling points) in terms
of generalization error and CPU time.

38 / 57
MCMC, EP and Laplace Approximation

Table : Comparisons of MCMC, EP and Laplace approximation on four

data sets.

Methods Cancer Diabetics

error AUC #vec CPUTime error AUC #vec CPUTime
MCMC 26.61 71.94 12 669.1s 23.17 82.86 23 764.1s
EP 26.65 72.53 9 3.2s 23.18 82.89 17 357.2s
Laplace 26.71 72.03 16 0.2s 23.11 83.12 22 1.1s

Methods Heart Thyroid

error AUC #vec CPUTime error AUC #vec CPUTime
MCMC 16.37 90.67 16 707.4s 4.94 98.71 22 913.1s
EP 16.65 90.91 13 254.7s 5.16 98.63 10 61.2s
Laplace 16.65 90.83 15 0.3s 5.02 98.87 21 0.2s

39 / 57
Synthetic Data Sets

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−1 −1 −1

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

(a) Spiral: SVM (b) Spiral: RVM (c) Spiral: PCVM

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−1.5 −1.5 −1.5

−2 −2 −2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

(d) Bumpy: SVM (e) Bumpy: RVM (f) Bumpy: PCVM

40 / 57
Synthetic Data Sets

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

(g) Relevance: SVM (h) Relevance: RVM (i) Relevance: PCVM

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

(j) Overlap: SVM (k) Overlap: RVM (l) Overlap: PCVM

PCVM can handle predominating linear data.

41 / 57
Setup for Benchmark Tests

Compared algorithms: PCVM, SVM, relevance vector machine

(RVM) and sparse multinomial logistic regression (SMLR).

Baseline algorithms: linear/quadratic discriminant analysis

(LDA/QDA) and k k
Nearest Neighbor ( NN).

Parameter optimization by cross validations, including kernel

parameters in SVM, RVM, EPCVM, SMLR.

SMLR stands for sparse multinomial logistic regression

(Krishnapuram05: Sparse Multinomial Logistic Regression: Fast
Algorithms and Generalization Bounds, IEEE TPAMI, 27(6), 2005)

42 / 57
Summary of Benchmark Data Sets

Data No. Train No. Test Positive % Negative % Dim

Abalone 2089 2088 50.18% 49.82% 8
Banana 2650 2650 44.83% 55.17% 2
Cancer 132 131 29.28% 70.72% 9
Diabetics 384 384 34.90% 65.10% 8
German 500 500 30.00% 70.00% 20
Heart 135 135 44.44% 55.56% 13
Image 1043 1043 56.95% 43.05% 18
Ringnorm 3700 3700 49.51% 50.49% 20
Splice 1496 1495 44.93% 55.07% 60
Thyroid 108 107 30.23% 69.77% 5
Titanic 1101 1100 58.33% 41.67% 3
Twonorm 3700 3700 50.04% 49.96% 20
Waveform 2500 2500 32.94% 67.06% 21

43 / 57
Benchmark Results

1.01
PCVM
PCVM 1
1

0.98
0.99
SMLR SMLR
0.98 SVM 0.96

0.97
0.94
RVM RVM
0.96
0.92
SVM
0.95 knn
QDA
0.9
0.94
QDA
LDA 0.88
0.93 LDA
knn
0.92 0.86
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(m) Error Rate (n) AUC

x -axis: sparsity degree, i.e. % data points used in prediction

y -axis: normalized performance across 13 data sets.

PCVM is less sparse than RVM.

PCVM achieves the best performance with error rate and AUC.

44 / 57
Scalability

12000 0.17
fast PCVM fast PCVM
SVMlight 0.168 SVMlight
SMLR
10000 SMLR
0.166 RVM
RVM

0.164
8000
CPU Time (s)

0.162

Err rate (%)

6000 0.16

0.158
4000
0.156

0.154
2000
0.152

0 0.15
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
number of trainning points number of trainning points

(o) CPU time on Adult Data Set (p) Error Rate on Adult Data Set

Figure : Comparison of CPU time and the err rate of fast PCVM, SVM,
SMLR and RVM on Adult data set.

45 / 57
Analysis

PCVM scales well with the number of training points without

compromise the performance.

RVM and SMLR do not scale well with increased data points.

SVMLight is the fastest algorithm as it was optimised by

sequential minimal optimization algorithm (SMO) and the
optimization for large problems have been implemented.

46 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

47 / 57
Rademacher Complexity Bound
Rademacher complexity measures richness of a class of
real-valued functions.

(Meir03) Consider arbitrary scalars g > 0, r > 0. Then, for

δ ∈ (0, 1) with probability at least (1 − δ) over draws of training
sets, the following bound holds:

s
r r g̃ (q) + 1 ln 1
g̃ (q ) ln logr
g 2 δ
P (yf (x , q ) < 0) ≤ Remp [f , D ]+
2 2
+ ,
s N N
where Remp is the empirical loss,

N
X
ls (yn f (x n , q )),
1
Remp [f , D ] = and g̃ (q ) = r ·max(KL(q ||p), g ),
N n=1
where KL(q ||p) is the Kullback-Leibler divergence from the
posterior q to the prior p over parameters w .
48 / 57
KL Divergence between Prior and Posterior

The KL divergence is a non-symmetric measure of the

dierence between two probability distributions.

The bound is related to Remp [f , D ] and KL(q ||p). Given the

same Remp [f , D ], the bound is tight with small KL(q ||p).
The KL divergence from normalized truncated posterior
p(w |t ) to truncated Gaussian prior p(w |α) is
Z ∞
p̃(w |t )
p̃(w |t ) ln d w − ln A0 .
1
KL(q ||p) =
A0 0 p(w |α)
where p̃ stands for un-normalized posterior/prior.
A0 : the cumulative distribution function of posterior p(w |t )
when weights are non-negative.

49 / 57
Kullback-Leibler Divergence Between Prior and Posterior

Adopt the independence assumption on weight vector, then

 h i 

 1 α − 1 + ln α̂ + αi w 2 

 2 α̂ i
i i
 α 

X  
i i

(2π α̂ )−1/2 (α +α̂ )w

KL(q ||p) = 
+
erfcx
i
√i

− w α̂ /2
i i


,
i ,w 6=0 
 
i i

 
− ln erfc − w 2α̂ 
i
 i i 

where

erfcx (a) = e a2 erfc (a).

αi are the initial hyperparameters.

α̂i are the optimised hyperparameters.

Fix the initial hyperparameter to αi = 0.5 (the value used in

the paper), then we obtain

50 / 57
KL divergence between Truncated Posterior and Gaussian prior

KL Divergence
20

0
10
5
4
5
wi 3
2
1 α̂i
0 0

KL(q ||p) is much more sensitive to weights wi than the

optimized posterior hyperparameters α̂i .
Sparseness helps to minimize the KL(q ||p) divergence.

51 / 57
Sparseness and the Bound

g̃ (q ) = r · max(KL(q ||p), g ),

Minimizing KL does not lead to minimal g̃ : KL that is lower

than g does not help to further reduce g̃ .
The generalization bound is to minimize empirical loss term
(needs sucient (i.e. not too sparse) parameters of the model)
and the sparsity (represented by KL(q ||p) and g ).

More sparseness may not be better, e.g. RVM is more sparse

than SVM and PCVM (mean of truncated normal distribution
is not zero)

Adequate sparsity is preferred in sparse Bayesian learning.

52 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

53 / 57
Conclusion

EPCVM makes Bayesian classication more stable with respect

to kernel parameters by pointing at the weakness of standard
Gaussian prior (used for decades).

The solution of EPCVM is fully Bayesian by using the Laplace

approximation and expectation propagation.

EPCVM can incrementally choose basis functions into the

model by maximizing model evidence, which makes EPCVM
computationally more ecient.

Theoretical analysis for EPCVM and comprehensive empirical

analysis.

54 / 57
For Further Reading 1

(Chen09) H. Chen, P. Tino, and X. Yao, Probabilistic

classication vector machines, IEEE Transactions on Neural
Networks, vol. 20, pp. 901 914, 2009.

(Chen13) H. Chen, P. Tino, and X. Yao, Ecient Probabilistic

Classication Vector Machine with Incremental Basis Function
Selection, IEEE Transactions on Neural Networks, 2013.
Accepted.

(Tipping01) M. E. Tipping, Sparse bayesian learning and the

relevance vector machine, Journal of Machine Learning
Research, vol. 1, pp. 211244, 2001.

55 / 57
For Further Reading 2

(Tipping03) M. E. Tipping and A. Faul, Fast marginal

likelihood maximisation for sparse bayesian models, in
Proceedings of the Ninth international workshop on articial
intelligence and statistics, vol.1, no.3, 2003.

(Meir03) R. Meir and T. Zhang, Generalization error bounds

for bayesian mixture algorithms, Journal of Machine Learning
Research, vol. 4, pp. 839860, 2003

56 / 57
Demo and Thank you!

Many thanks for your attention!

57 / 57

Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
IEEE C57.113-1991 - Guide For Partial Discharge Measurement in Liquid-Filled Power Transformers and Shunt Reactors PDF
No ratings yet
IEEE C57.113-1991 - Guide For Partial Discharge Measurement in Liquid-Filled Power Transformers and Shunt Reactors PDF
14 pages
Gonzalez 2021
No ratings yet
Gonzalez 2021
67 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
Wilson2020 Part2
No ratings yet
Wilson2020 Part2
47 pages
TheoryIdeasInla.screen
No ratings yet
TheoryIdeasInla.screen
69 pages
Abir
No ratings yet
Abir
13 pages
CS-601-Machine-learning-Unit-5 (1)
No ratings yet
CS-601-Machine-learning-Unit-5 (1)
18 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
Model Selection Talk
No ratings yet
Model Selection Talk
48 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
Chapter 2.3.6
No ratings yet
Chapter 2.3.6
4 pages
Bayesian Modeling.Student
No ratings yet
Bayesian Modeling.Student
26 pages
ML Lecture Linear Regression 3
No ratings yet
ML Lecture Linear Regression 3
22 pages
Moss23a Time Series (1)
No ratings yet
Moss23a Time Series (1)
18 pages
ml-3
No ratings yet
ml-3
66 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
CotterShalevSrebro2013
No ratings yet
CotterShalevSrebro2013
14 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
Probabilistic Feature Selection and Classification Vector Machine
No ratings yet
Probabilistic Feature Selection and Classification Vector Machine
27 pages
CS772-Lec5
No ratings yet
CS772-Lec5
22 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
Priors in Bayesian Learning
No ratings yet
Priors in Bayesian Learning
26 pages
BR 2
No ratings yet
BR 2
36 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Download ebooks file Leadership: No More Heroes 3rd Edition David Pendleton all chapters
100% (17)
Download ebooks file Leadership: No More Heroes 3rd Edition David Pendleton all chapters
65 pages
2B Naive Bayes
No ratings yet
2B Naive Bayes
90 pages
Support Vector Machines as Probabilistic Models
No ratings yet
Support Vector Machines as Probabilistic Models
8 pages
Ds 7
No ratings yet
Ds 7
20 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
No ratings yet
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
4 pages
An Overview of Bayesian Econometrics
No ratings yet
An Overview of Bayesian Econometrics
30 pages
Key Concepts in Probabilistic Learning and SVMs
No ratings yet
Key Concepts in Probabilistic Learning and SVMs
15 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
QUESTIONS
No ratings yet
QUESTIONS
20 pages
Placement Project BMIH6006.8 - Autumn Term 2023 Handbook FINAL KF EE PDF
No ratings yet
Placement Project BMIH6006.8 - Autumn Term 2023 Handbook FINAL KF EE PDF
20 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
3.1+Periodic+Phenomena
No ratings yet
3.1+Periodic+Phenomena
2 pages
Bode Plot
No ratings yet
Bode Plot
5 pages
CB PDF
No ratings yet
CB PDF
69 pages
MLT Unit-2
No ratings yet
MLT Unit-2
30 pages
Bayesian Methods For Support Vector Machines: Evidence and Predictive Class Probabilities
No ratings yet
Bayesian Methods For Support Vector Machines: Evidence and Predictive Class Probabilities
32 pages
Indicating Slots, Milling, Contour and Rout-Outs in Your PCB Design
No ratings yet
Indicating Slots, Milling, Contour and Rout-Outs in Your PCB Design
10 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Water sampling points
No ratings yet
Water sampling points
2 pages
Analysis of Delay and Queue Length Variation at Three-Leg Signalized Intersection Under Mixed Traffic Condition
No ratings yet
Analysis of Delay and Queue Length Variation at Three-Leg Signalized Intersection Under Mixed Traffic Condition
11 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Statistical Methods For NLP: Text Categorization, Support Vector Machines
No ratings yet
Statistical Methods For NLP: Text Categorization, Support Vector Machines
28 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Experiment On Basic Concepts: Experiment 4.1 Measurement of Viscosity by Redwood Viscometer
100% (1)
Experiment On Basic Concepts: Experiment 4.1 Measurement of Viscosity by Redwood Viscometer
5 pages
Tut6 Questions
No ratings yet
Tut6 Questions
2 pages
Website - Machine Learning
No ratings yet
Website - Machine Learning
6 pages
Aspects That Influence Reading Development
No ratings yet
Aspects That Influence Reading Development
24 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
ENTREP
No ratings yet
ENTREP
8 pages
11TH Summer Assignment Science
No ratings yet
11TH Summer Assignment Science
2 pages
Transkrip Nilai F11119035 (14)
No ratings yet
Transkrip Nilai F11119035 (14)
2 pages
Ryan Adams 140814 Bayesopt Ncap
No ratings yet
Ryan Adams 140814 Bayesopt Ncap
84 pages
4.1. Shooting Method - Mechanical Engineering Methods Notes
No ratings yet
4.1. Shooting Method - Mechanical Engineering Methods Notes
6 pages
Sim Ge3
No ratings yet
Sim Ge3
144 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Study Plan
No ratings yet
Study Plan
4 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Electrom iTIG II Moottorimittauslaitteet - ENG
No ratings yet
Electrom iTIG II Moottorimittauslaitteet - ENG
6 pages
JIT Template
No ratings yet
JIT Template
2 pages
Experiment - : OBJ Ctive
No ratings yet
Experiment - : OBJ Ctive
11 pages
Algebraic Formula Sheet
No ratings yet
Algebraic Formula Sheet
4 pages
101 Unit 1 Fall 2021
No ratings yet
101 Unit 1 Fall 2021
20 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
Slides - Multi-Object Tracking
No ratings yet
Slides - Multi-Object Tracking
38 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
AI Genders - Student
No ratings yet
AI Genders - Student
4 pages
Biomedical Equipment: Workshop 12
No ratings yet
Biomedical Equipment: Workshop 12
5 pages
RVM Tutorial
No ratings yet
RVM Tutorial
25 pages
Facebook Fake Reviews - Student
No ratings yet
Facebook Fake Reviews - Student
4 pages
DHA Lab Guidlines 2019-717-720
No ratings yet
DHA Lab Guidlines 2019-717-720
4 pages
Declaration of Performance: No. ISOFLEX /1408-02
No ratings yet
Declaration of Performance: No. ISOFLEX /1408-02
3 pages
John Philip B. Marcelino Ii-Cfm Activity 1
No ratings yet
John Philip B. Marcelino Ii-Cfm Activity 1
4 pages
Norwegian Sugar Tax - Student
No ratings yet
Norwegian Sugar Tax - Student
4 pages
Chapter11 Ancillaryroadworks2020
No ratings yet
Chapter11 Ancillaryroadworks2020
105 pages
Lesson Plan For Hu, An Body (Grade 1)
No ratings yet
Lesson Plan For Hu, An Body (Grade 1)
8 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Learning Objectives
No ratings yet
Learning Objectives
20 pages
Mid Year F2 2021
No ratings yet
Mid Year F2 2021
13 pages

Sparse Bayesian Learning - Analysis and Applications

Uploaded by

Sparse Bayesian Learning - Analysis and Applications

Uploaded by

Sparse Bayesian Learning: Analysis and

School of Computer Science and Technology

VALSE, Nanjing, 2013

2 Gaussian Prior Improper for Classication Problems

4 Analysis of Sparsity and Generalization

2 Gaussian Prior Improper for Classication Problems

4 Analysis of Sparsity and Generalization

w: is the weight vector of the model, e.g. weight vector in

prior P (w ): the probability of w before data D is observed.

likelihood P (D |w ): the probability of observing data D given

P (D ): marginal likelihood or model evidence. It is crucial for

Parametric Bayesian model: Prior on parameter with xed or

Prior on parameter: sparseness generating prior → sparse

Nonparametric Bayesian model: ∞-dimensional parameter

Prior on function → very exible models.

This talk focuses on parametric/sparse Bayesian model.

In the estimated model f (X ; w ) = X w , if many weights, i.e.

Sparsity → variable selection → model interpretability.

Sparsity → regularization → less overtting & better

P (w ) has the highest probability when w = 0.

Examples: Gaussian prior, Gaussian prior with hyperparameter

Maximization of posterior in Bayesian inference is equivalent to

g (w )− (prior) sparse regularization, e.g. l0 , l1 -lasso

Spare Bayesian Learning (SBL)

Automatic model selection, i.e. regularization parameters αi

Probabilistic outputs with condence intervals (covariance

To simplify calculation, the normalization term

In fact, P (D |α) is crucial for automatic model selection, i.e.

For best hyperparameter α after observing data D , we need to

Maximization of evidence P (D |α) is to maximize the posterior

1 Initialization: choose an initial hyperparameter α value.

2 Posterior maximization: update the optimal weight vector w

3 Evidence maximization: update the hyperparameter α by

Choose proper prior and likelihood distributions for specic

Eective optimization approaches to maximize the posterior of

Hidden variable solutions: Expectation Maximization. Pros:

Is Gaussian prior appropriate for all problems?

Bayesian methods are the most powerful when your prior

Gaussian prior used for several decades. Is it proper for

Does more sparsity mean better solutions?

More sparsity: simpler model, but might lack of freedom to

2 Gaussian Prior Improper for Classication Problems

4 Analysis of Sparsity and Generalization

The largest margin

SVM maximises the margin of two classes and try to minimise

SVM makes predictions based on the function:

xn are training examples

Sparse solution: some weights wn are 0.

Non-probabilistic but hard binary decisions.

Kernel parameters and parameter C (control the error

where αn is the inverse variance of Gaussian.

Probability density function of Gaussian Prior

sparser than SVM

Some training points that belong to positive class (yn = +1)

In kernel methods, every point makes impacts to the decision

In SVM, every point will either vote for decision domain by

In RVM, every point can vote for and against decision

Voting for or/and Against?

Any voting system permits some expression of disapproval,

Is this the same in machine learning? 22 / 57

1.4 1.4 1.4

1.2 1.2 1.2

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

−0.2 −0.2 −0.2

1.4 1.4 1.4

1.2 1.2 1.2

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

−0.2 −0.2 −0.2

The used vectors ( wi 6= 0) whose weights have opposite signs

RVM, σ =0.7, vectors =355, error=12.4694%

2 Gaussian Prior Improper for Classication Problems

2 Gaussian Prior Improper for Classication Problems

Parametric Bayesian model: Prior on parameter with xed or

Prior on function → very exible models.

Sparsity → regularization → less overtting & better

Probabilistic outputs with condence intervals (covariance

Choose proper prior and likelihood distributions for specic

Eective optimization approaches to maximize the posterior of

2 Gaussian Prior Improper for Classication Problems

In SVM, every point will either vote for decision domain by

In RVM, every point can vote for and against decision

Any voting system permits some expression of disapproval,

Analyzing the rst/second gradient of the above equation and we

2 Gaussian Prior Improper for Classication Problems