0% found this document useful (0 votes)
70 views

Sparse Bayesian Learning - Analysis and Applications

The document discusses sparse Bayesian learning (SBL), which uses sparse priors to encourage sparse solutions. It explains that SBL is equivalent to regularized regression with the prior as the regularization term. SBL allows for automatic model selection by maximizing the model evidence. However, calculating the model evidence is computationally intensive. The document outlines an iterative approach to optimize the posterior and evidence to perform SBL, but notes choosing appropriate distributions and optimization methods remain critical challenges.

Uploaded by

Danilo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Sparse Bayesian Learning - Analysis and Applications

The document discusses sparse Bayesian learning (SBL), which uses sparse priors to encourage sparse solutions. It explains that SBL is equivalent to regularized regression with the prior as the regularization term. SBL allows for automatic model selection by maximizing the model evidence. However, calculating the model evidence is computationally intensive. The document outlines an iterative approach to optimize the posterior and evidence to perform SBL, but notes choosing appropriate distributions and optimization methods remain critical challenges.

Uploaded by

Danilo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Sparse Bayesian Learning: Analysis and

Applications

Huanhuan Chen

School of Computer Science and Technology


University of Science & Technology of China

VALSE, Nanjing, 2013

1 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

2 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

3 / 57
What is Bayesian Inference?
Bayesian inference: a method of inference using Bayesian' rule to
incorporate likelihood and our belief (prior) distributions with
proper model selection.

P (D |w )P (w )
P (w |D ) =
P (D )

w: is the weight vector of the model, e.g. weight vector in


neural networks. D is the observed data set.

prior P (w ): the probability of w before data D is observed.


This can be expert knowledge or preference about the model,
e.g. sparseness.

likelihood P (D |w ): the probability of observing data D given


w.
posterior P (w |D ): the probability of w after D is observed.

P (D ): marginal likelihood or model evidence. It is crucial for


model selection in Bayesian inference.
4 / 57
Parametric or Nonparametric Bayesian

Parametric Bayesian model: Prior on parameter with xed or


bounded number of parameters.

Prior on parameter: sparseness generating prior → sparse


model
Examples: Bayesian neural networks, Relevance Vector
Machine, Probabilistic Classication Vector Machine (PCVM),
etc.

Nonparametric Bayesian model: ∞-dimensional parameter


space

Prior on function → very exible models.


Not sparse and computational intensive: training O (N 3 ),
testing O (N 2 ).
Examples: Gaussian Processes, Dirichlet Processes, etc

This talk focuses on parametric/sparse Bayesian model.

5 / 57
What is sparse model?

In the estimated model f (X ; w ) = X w , if many weights, i.e.


wi = 0, are zero, the obtained mode is referred as sparse model.

w
z }| {
f X w1
z }| { z }| {  0 
f1 x11 x12 x13 x14 x15 ··· x1p  
 0 
 f2   x21 x22 x23 x24 x25 ··· x2p  
 
   ·  w4 
 ···  =  ··· ··· ··· ··· ··· ··· ···   
 0 
fN xN 1 xN 2 xN 3 xN 4 xN 5 ··· xN p  
 ··· 
wp

Sparsity → variable selection → model interpretability.

Sparsity → regularization → less overtting & better


prediction.

6 / 57
How to generate sparsity in sparse Bayesian learning?
Sparseness generating prior encourages sparseness:

P (w ) has the highest probability when w = 0.


The higher P (w ) at 0, more sparse.

Examples: Gaussian prior, Gaussian prior with hyperparameter


Gamma prior ( P (wi ) ∝ 1/ |wi |); Laplace prior · · ·
Probability density function of Prior
2.5
Gaussian Prior
Laplace Prior

1.5
P(w)

0.5

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
w

7 / 57
A Regression Example: Parametric Bayesian Solution
Given a training set D = (x n , yn )Nn=1 , x n ∈ R p , yn ∈ R .
Likelihood: training mean square error (MSE) assuming
zero-mean Gaussian noise
 
P (D |w) = (2πσ2 )−N /2 exp −
1 2
2σ 2 kf (xn ; w)−yn k ,
Prior: regularization term

N
Y
P (w|α) = N (wi |0, αi−1 ),
n=0
Posterior: the optimized weight vector

N
X X p
max log P (w |D ) ∝ min (f (x n ; w ) − yn )2 + αi wi2
w w
n=1 i =1

Maximization of posterior in Bayesian inference is equivalent to


regularized regression, with the prior as regularized term.
8 / 57
Relationship between Sparse Learning and Bayesian Inference

Sparse Learning

ŵ = arg minw R (w ) + λg (w )
R (w )− Likelihood (cost) function, e.g. MSE, cross entropy,
etc

g (w )− (prior) sparse regularization, e.g. l0 , l1 -lasso


Parameter λ need to be tuned by cross validation

Spare Bayesian Learning (SBL)

PN
arg maxw log P (w |D ) ∝ minw R (w ) + n=1 αn g (wn )
Parameter αn is equivalent to the trade-o parameter λ.

9 / 57
The benets of Bayesian Inference

Automatic model selection, i.e. regularization parameters αi


and (potential) kernel parameters, feature selection, ···, by
maximizing the model evidence P (D |α).
Expert knowledge or preferences of the models can be easily
incorporated into the model by prior distribution.

Probabilistic outputs with condence intervals (covariance


matrix)

10 / 57
Ignore the normalization term, or not?

P (D |w )P (w |α)
P (w |D ) =
P (D |α)

To simplify calculation, the normalization term


R
P (D |α) = P (D |w )P (w |α)d w is often ignored to save
calculation.

In fact, P (D |α) is crucial for automatic model selection, i.e.


automatically choose best αn .

11 / 57
How to automatically select model in SBL?

For best hyperparameter α after observing data D , we need to


maximize the posterior of P (α|D ):
P (D |α)P (α)
P (α|D ) =
P (D )
If an uniform prior P (α) is adopted. Then,

P (α|D ) ∝ P (D |α).

Maximization of evidence P (D |α) is to maximize the posterior


P (α|D ) of hyperparameter.

12 / 57
Iteratively optimize posterior and evidence in SBL

1 Initialization: choose an initial hyperparameter α value.

2 Posterior maximization: update the optimal weight vector w


by maximizing the posterior of weights P (w |D ) with previous
α.

3 Evidence maximization: update the hyperparameter α by


maximizing the evidence P (D |α).
4 Loop steps (2) and (3) until converged.

13 / 57
What are the critical problems in SBL?

Choose proper prior and likelihood distributions for specic


problems.

Eective optimization approaches to maximize the posterior of


parameters and model evidence: gradient based approaches,
coordinate descent, etc.
R
Posterior and evidence P (D |α) = P (D |w )P (w |α)d w are
important but often intractable if prior or likelihood is not
Gaussian!

Hidden variable solutions: Expectation Maximization. Pros:


simple derivations and implementations; cons: sensitive to
initializations, local minima.
Integral approximation techniques for analytical solutions:
Laplace approximation, Variational Bayesian, Expectation
Propagation (EP)

14 / 57
Rethinking of two questions in SBL

Is Gaussian prior appropriate for all problems?

Bayesian methods are the most powerful when your prior


adequately captures your beliefs. Improper prior yield
unreasonable inferences.

Gaussian prior used for several decades. Is it proper for


classication?

Does more sparsity mean better solutions?

More sparsity: simpler model, but might lack of freedom to


approximate the feature-label mapping.

15 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

16 / 57
Support Vector Machines: Margin Maximisation

The largest margin

SVM maximises the margin of two classes and try to minimise


the generalisation error.
The training points that are nearest to the separating function
are called support vectors. The model is immune to removal of
any non-support-vector data points.
17 / 57
Support Vector Machines

Formulation

SVM makes predictions based on the function:

N !
X
f (x ; w ) = sign yn wn K (x , x n ) + b
n=1

xn are training examples


K (x , xn ) is the kernel function
yn ∈ {−1, +1} is the label for xn
N is the total number of training examples
wn is non-negative Lagrange multiplier: wn is either zero or
positive.

18 / 57
Analysis of Support Vector Machines

Advantages

Good generalization

Sparse solution: some weights wn are 0.

Disadvantages

Non-probabilistic but hard binary decisions.

Kernel parameters and parameter C (control the error


tolerance) need to be tuned by cross validation: time
consuming.

19 / 57
Relevance Vector Machine
A Bayesian treatment of a generalized linear model

N !
X
f (x ; w ) = σ wn φn (x ) + b ,
n=1
where σ(·) is the sigmoid function for probabilistic outputs.
RVM is a Bayesian linear model with sparse prior on weights w
p(wn |αn ) = N (wn |0, αn 1 ). −

where αn is the inverse variance of Gaussian.

Probability density function of Gaussian Prior


2

1.8

1.6

1.4

1.2
P(w)

0.8

0.6

0.4

0.2

0
−1 −0.5 0 0.5 1
w
20 / 57
Analysis of RVM

Advantages

probabilistic output

sparser than SVM

Disadvantages

Some training points that belong to positive class (yn = +1)


may have negative weights and vice versa, leading to the
situation that the decision of RVM is based on some
untrustworthy vectors, and thus is sensitive to the kernel
parameter (even with well-selected kernel parameters.)

21 / 57
Some Discussions on Voting and Learning

In kernel methods, every point makes impacts to the decision


boundary.

In SVM, every point will either vote for decision domain by


class label, or do not vote.

In RVM, every point can vote for and against decision


domain.

Voting for or/and Against?

Any voting system permits some expression of disapproval,


but these are necessarily confused with expressions of choice or
approval, leading some to conclude that separating these
expressions is best. (wikipedia)

Is this the same in machine learning? 22 / 57


Illustration

Kw+bI

−2

−4

−6
1

0.5
−0.2
0 0
0.2
0.4
−0.5 0.6
0.8
1
−1 1.2

23 / 57
Unstable RVM with respect to kernel parameter (Gaussian kernel)
RVM, σ =0.5, vectors =7, error=9.9% SVM, σ =0.5, C =10, vectors =94, error=9.4% PCVM, σ =0.5, vectors =5, error=9.4%
1.6 1.6 1.6

1.4 1.4 1.4

1.2 1.2 1.2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2


−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1

RVM, σ =0.3, vectors =243, error=12.6% SVM, σ =0.3, C =10, vectors =98, error=9.7% PCVM, σ =0.3, vectors =4, error=8.5%
1.6 1.6 1.6

1.4 1.4 1.4

1.2 1.2 1.2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2


−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1

The used vectors ( wi 6= 0) whose weights have opposite signs


are shown circled.
More redundant vectors with small opposite signs might lead
to unstable solutions.
24 / 57
Unstable RVM with respect to kernel parameter

RVM, σ =0.5, vectors =16, error=11.6327% SVM, σ =0.5, C =10, vectors =104, error=11.4286% PCVM, σ =0.5, vectors =15, error=11.551%

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

RVM, σ =0.7, vectors =355, error=12.4694%


SVM, σ =0.7, C =10, vectors =101, error=11.0408% PCVM, σ =0.7, vectors =13, error=11.6327%
3
3 3

2
2 2

1
1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

25 / 57
Theoretical Analysis (Chen09)

Maximum-a-posterior (MAP) analysis: PCVM with truncated


priors has higher posterior than models with Gaussian priors.

Margin analysis: PCVM with truncated priors has larger


margin than models with Gaussian priors, especially with a
localized basis function.

26 / 57
Discussions

SVM always assign positive/negative weights to


positive/negative points. This principle is implemented in
SVM by enforcing the Lagrange multipliers to be non-negative.

How to combine the advantages of RVM and SVM and discard


the unstable characteristics?

27 / 57
Probabilistic Classication Vector Machines

Combine the advantages of SVM and RVM

N !
X
y (x ; w ) = σ yn wn φn (x ) + b ,
n=1
Left-truncated Gaussian Prior on wn for non-negative wn

p(wn |αn ) = 2N (wn |0, αn−1 ) if wn ≥ 0
0 otherwise

Hyper-parameters: parameter αn .

28 / 57
Truncated Prior for PCVM

0.4

0.35

0.3

0.25
p(wi|αi)
0.2

0.15

0.1

0.05

0
0 1 2 3 4 5 6
wi

The mean of truncated Gaussian prior is larger than zero.

It is less sparse than RVM with (hierarchical) student-t prior.

Question: more spareness = better generalization?

29 / 57
PCVM Formulation
Non-negative Prior

Y N
p(w |α) = N (w0 |0, α0 1 ) 2N (wi |0, αi−1 ) · δ(wi )

i =1
where δ(·) is the indicator function 1x ≥0 (x ).

Bernoulli Likelihood

YN
p(t | w ) = σit [1 − σi ]1−t ,
i i

n=1
P 
N w φ (x ) , t = (t , t , · · · , t )T
where σn = σ n=0 n n n 1 2 N is a vector
y +1
of targets, ti =
2 ∈ {0, 1} is the probabilistic target.
i

30 / 57
Derivations

According to Bayesian's theorem, the posterior is:

p(t |w )p(w |α)


p(w |t ) = .
p(t |α)

The integral to calculate posterior


R p(w |t ) and model evidence
p(t |α) = p(t |w )p(w |α)d w are intractable due to the
truncated prior.

31 / 57
Solutions

Hidden variables

Expectation Maximization (EM): simple derivations,


simultaneously optimize kernel parameters, but sensitive to
initialization and may converge to local minima (Chen09)

Integral Approximation

Laplace Approximation: deterministic and fast, and the


performance is acceptable (veried by MCMC) (Chen13)

Expectation Propagation (EP): accurate but slow (Chen13)

Markov chain Monte Carlo (MCMC): the most accurate but


very slow (Chen13)

32 / 57
Case Study using Laplace Approximation

The most probable w , i.e. posterior, can be obtained by


maximizing the following log likelihood

Q = log {p(t |w )p(w |α)} − log p(t |α)


N
X N
X
αi wi2
1
= [ti log σi + (1 − ti ) log(1 − σi )] −
2
i =1 i =0
X N
+ log δ(wi ) − const.
i =1

33 / 57
Posterior of weight vector

Analyzing the rst/second gradient of the above equation and we


obtain the optimal value

 
w MAP = A−1 ΦT (t − σ) + k
ΣMAP = (ΦT B Φ + A + D )−1 .

P 
N y w φ (x ) ,
where σi = σ n=0 n n n i

A = diag (α0 , α1 , · · · , αN ),
D = diag (0, d1 , · · · , dN ) =
diag (0, σ(β w1 )(1 − σ(β w1 ))β 2 , · · · , σ(β wN )(1 − σ(β wN ))β 2 )
k = [0, β(1 − σ(β w1 )), · · · , β(1 − σ(β wN ))]T is the N + 1
vector, aiming to ensure that weights wi are non-negative.

34 / 57
Ecient PCVM by sequentially maximize model evidence

Model evidence L(α) = P (t |α) can be written as


L(α) = L(α−i ) + l (αi ),
where

L(α−i ): the model evidence with basis function φi deleted.


l (αi ): the contribution of αi to evidence when include φi .
Analyzing each l (αi ) → sequentially maximize model evidence →
incremental PCVM.

35 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

36 / 57
MCMC vs. Laplace Approximation

PCA Analysis of MCMC (Metropolis−Hastings) and Laplace


PCA Analysis of MCMC (Metropolis−Hastings) and Laplace 8
8
MCMC Sampling Points
MCMC Sampling Points Laplace mean 60
Laplace mean 60 Gaussian Ellipse Contour by Laplace
Gaussian Ellipse Contour by Laplace 6
6

50
50
4
4

40 40
second component

second component
2 2

30 30
0 0

20 −2 20
−2

−4 10 −4 10

−6 −6
−6 −4 −2 0 2 4 6 −2 0 2 4 6 8 10 12
first component first component

(m) Synth (n) Heart

Figure : The posteriors of combination weights calculated by MCMC


(40000 sampling points) and Laplace Approximation.

37 / 57
MCMC, EP and Laplace Approximation

Posterior Mean Posterior Mean


0.2
Laplace Laplace
EP 0.28 EP
HMC HMC
0.18
0.26
Generalization Error

Generalization Error
0.16 0.24

0.22
0.14

0.2
0.12
0.18

0.1 0.16

0.14
0.08
−2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10
CPU Time (seconds) CPU Time (seconds)

(a) Synth (b) Heart

Figure : The comparisons of Laplace approximation, expectation


propagation and hybrid monte carlo (200,0000 sampling points) in terms
of generalization error and CPU time.

38 / 57
MCMC, EP and Laplace Approximation

Table : Comparisons of MCMC, EP and Laplace approximation on four


data sets.

Methods Cancer Diabetics


error AUC #vec CPUTime error AUC #vec CPUTime
MCMC 26.61 71.94 12 669.1s 23.17 82.86 23 764.1s
EP 26.65 72.53 9 3.2s 23.18 82.89 17 357.2s
Laplace 26.71 72.03 16 0.2s 23.11 83.12 22 1.1s

Methods Heart Thyroid


error AUC #vec CPUTime error AUC #vec CPUTime
MCMC 16.37 90.67 16 707.4s 4.94 98.71 22 913.1s
EP 16.65 90.91 13 254.7s 5.16 98.63 10 61.2s
Laplace 16.65 90.83 15 0.3s 5.02 98.87 21 0.2s

39 / 57
Synthetic Data Sets

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−1 −1 −1

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

(a) Spiral: SVM (b) Spiral: RVM (c) Spiral: PCVM


2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−1.5 −1.5 −1.5

−2 −2 −2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

(d) Bumpy: SVM (e) Bumpy: RVM (f) Bumpy: PCVM

40 / 57
Synthetic Data Sets

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2

−0.4 −0.4 −0.4

−0.6 −0.6 −0.6

−0.8 −0.8 −0.8

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

(g) Relevance: SVM (h) Relevance: RVM (i) Relevance: PCVM


1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

(j) Overlap: SVM (k) Overlap: RVM (l) Overlap: PCVM

PCVM can handle predominating linear data.

41 / 57
Setup for Benchmark Tests

Compared algorithms: PCVM, SVM, relevance vector machine


(RVM) and sparse multinomial logistic regression (SMLR).

Baseline algorithms: linear/quadratic discriminant analysis


(LDA/QDA) and k k
Nearest Neighbor ( NN).

Parameter optimization by cross validations, including kernel


parameters in SVM, RVM, EPCVM, SMLR.

SMLR stands for sparse multinomial logistic regression


(Krishnapuram05: Sparse Multinomial Logistic Regression: Fast
Algorithms and Generalization Bounds, IEEE TPAMI, 27(6), 2005)

42 / 57
Summary of Benchmark Data Sets

Data No. Train No. Test Positive % Negative % Dim


Abalone 2089 2088 50.18% 49.82% 8
Banana 2650 2650 44.83% 55.17% 2
Cancer 132 131 29.28% 70.72% 9
Diabetics 384 384 34.90% 65.10% 8
German 500 500 30.00% 70.00% 20
Heart 135 135 44.44% 55.56% 13
Image 1043 1043 56.95% 43.05% 18
Ringnorm 3700 3700 49.51% 50.49% 20
Splice 1496 1495 44.93% 55.07% 60
Thyroid 108 107 30.23% 69.77% 5
Titanic 1101 1100 58.33% 41.67% 3
Twonorm 3700 3700 50.04% 49.96% 20
Waveform 2500 2500 32.94% 67.06% 21

43 / 57
Benchmark Results

1.01
PCVM
PCVM 1
1

0.98
0.99
SMLR SMLR
0.98 SVM 0.96

0.97
0.94
RVM RVM
0.96
0.92
SVM
0.95 knn
QDA
0.9
0.94
QDA
LDA 0.88
0.93 LDA
knn
0.92 0.86
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(m) Error Rate (n) AUC

x -axis: sparsity degree, i.e. % data points used in prediction

y -axis: normalized performance across 13 data sets.

PCVM is less sparse than RVM.

PCVM achieves the best performance with error rate and AUC.

44 / 57
Scalability

12000 0.17
fast PCVM fast PCVM
SVMlight 0.168 SVMlight
SMLR
10000 SMLR
0.166 RVM
RVM

0.164
8000
CPU Time (s)

0.162

Err rate (%)


6000 0.16

0.158
4000
0.156

0.154
2000
0.152

0 0.15
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
number of trainning points number of trainning points

(o) CPU time on Adult Data Set (p) Error Rate on Adult Data Set

Figure : Comparison of CPU time and the err rate of fast PCVM, SVM,
SMLR and RVM on Adult data set.

45 / 57
Analysis

PCVM scales well with the number of training points without


compromise the performance.

RVM and SMLR do not scale well with increased data points.

SVMLight is the fastest algorithm as it was optimised by


sequential minimal optimization algorithm (SMO) and the
optimization for large problems have been implemented.

46 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

47 / 57
Rademacher Complexity Bound
Rademacher complexity measures richness of a class of
real-valued functions.

(Meir03) Consider arbitrary scalars g > 0, r > 0. Then, for


δ ∈ (0, 1) with probability at least (1 − δ) over draws of training
sets, the following bound holds:

s
r r g̃ (q) + 1 ln 1
g̃ (q ) ln logr
g 2 δ
P (yf (x , q ) < 0) ≤ Remp [f , D ]+
2 2
+ ,
s N N
where Remp is the empirical loss,

N
X
ls (yn f (x n , q )),
1
Remp [f , D ] = and g̃ (q ) = r ·max(KL(q ||p), g ),
N n=1
where KL(q ||p) is the Kullback-Leibler divergence from the
posterior q to the prior p over parameters w .
48 / 57
KL Divergence between Prior and Posterior

The KL divergence is a non-symmetric measure of the


dierence between two probability distributions.

The bound is related to Remp [f , D ] and KL(q ||p). Given the


same Remp [f , D ], the bound is tight with small KL(q ||p).
The KL divergence from normalized truncated posterior
p(w |t ) to truncated Gaussian prior p(w |α) is
Z ∞
p̃(w |t )
p̃(w |t ) ln d w − ln A0 .
1
KL(q ||p) =
A0 0 p(w |α)
where p̃ stands for un-normalized posterior/prior.
A0 : the cumulative distribution function of posterior p(w |t )
when weights are non-negative.

49 / 57
Kullback-Leibler Divergence Between Prior and Posterior

Adopt the independence assumption on weight vector, then


 h   i 

 1 α − 1 + ln α̂ + αi w 2 

 2 α̂ i
i i
 α 

X  
i i

(2π α̂ )−1/2 (α +α̂ )w


KL(q ||p) = 
+
erfcx
i
√i

− w α̂ /2
i i


,
i ,w 6=0 
    
i i

 
− ln erfc − w 2α̂ 
i
 i i 

where

erfcx (a) = e a2 erfc (a).


αi are the initial hyperparameters.

α̂i are the optimised hyperparameters.

Fix the initial hyperparameter to αi = 0.5 (the value used in


the paper), then we obtain

50 / 57
KL divergence between Truncated Posterior and Gaussian prior

30

25

KL Divergence
20

15

10

0
10
5
4
5
wi 3
2
1 α̂i
0 0

KL(q ||p) is much more sensitive to weights wi than the


optimized posterior hyperparameters α̂i .
Sparseness helps to minimize the KL(q ||p) divergence.

51 / 57
Sparseness and the Bound

g̃ (q ) = r · max(KL(q ||p), g ),

Minimizing KL does not lead to minimal g̃ : KL that is lower


than g does not help to further reduce g̃ .
The generalization bound is to minimize empirical loss term
(needs sucient (i.e. not too sparse) parameters of the model)
and the sparsity (represented by KL(q ||p) and g ).

More sparseness may not be better, e.g. RVM is more sparse


than SVM and PCVM (mean of truncated normal distribution
is not zero)

Adequate sparsity is preferred in sparse Bayesian learning.

52 / 57
Outline

1 Introduction

2 Gaussian Prior Improper for Classication Problems

3 Experimental Analysis

4 Analysis of Sparsity and Generalization

5 Conclusion

53 / 57
Conclusion

EPCVM makes Bayesian classication more stable with respect


to kernel parameters by pointing at the weakness of standard
Gaussian prior (used for decades).

The solution of EPCVM is fully Bayesian by using the Laplace


approximation and expectation propagation.

EPCVM can incrementally choose basis functions into the


model by maximizing model evidence, which makes EPCVM
computationally more ecient.

Theoretical analysis for EPCVM and comprehensive empirical


analysis.

54 / 57
For Further Reading 1

(Chen09) H. Chen, P. Tino, and X. Yao, Probabilistic


classication vector machines, IEEE Transactions on Neural
Networks, vol. 20, pp. 901 914, 2009.

(Chen13) H. Chen, P. Tino, and X. Yao, Ecient Probabilistic


Classication Vector Machine with Incremental Basis Function
Selection, IEEE Transactions on Neural Networks, 2013.
Accepted.

(Tipping01) M. E. Tipping, Sparse bayesian learning and the


relevance vector machine, Journal of Machine Learning
Research, vol. 1, pp. 211244, 2001.

55 / 57
For Further Reading 2

(Tipping03) M. E. Tipping and A. Faul, Fast marginal


likelihood maximisation for sparse bayesian models, in
Proceedings of the Ninth international workshop on articial
intelligence and statistics, vol.1, no.3, 2003.

(Meir03) R. Meir and T. Zhang, Generalization error bounds


for bayesian mixture algorithms, Journal of Machine Learning
Research, vol. 4, pp. 839860, 2003

56 / 57
Demo and Thank you!

Many thanks for your attention!

57 / 57

You might also like