0% found this document useful (0 votes)
60 views67 pages

Bayesian PDF

The document summarizes Vikas Raykar's presentation on Bayesian multiple instance learning and automatic feature selection. The key points are: 1. Raykar proposes a new algorithm called MIRVM (Multiple Instance Relevance Vector Machine) that extends logistic regression to handle multiple instance learning problems while performing joint feature selection and classification. 2. MIRVM models the probability of a bag being positive as 1 minus the probability that all instances in the bag are negative, allowing it to handle the multiple instance learning scenario. 3. Raykar evaluates MIRVM on several experiments and discusses how it can also be extended to multi-task learning problems.

Uploaded by

Andy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views67 pages

Bayesian PDF

The document summarizes Vikas Raykar's presentation on Bayesian multiple instance learning and automatic feature selection. The key points are: 1. Raykar proposes a new algorithm called MIRVM (Multiple Instance Relevance Vector Machine) that extends logistic regression to handle multiple instance learning problems while performing joint feature selection and classification. 2. MIRVM models the probability of a bag being positive as 1 minus the probability that all instances in the bag are negative, allowing it to handle the multiple instance learning scenario. 3. Raykar evaluates MIRVM on several experiments and discusses how it can also be extended to multi-task learning problems.

Uploaded by

Andy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Bayesian multiple instance learning: automatic feature

selection and inductive transfer

Vikas Chandrakant Raykar


(joint with Balaji Krishnapuram, Jinbo Bi, Murat Dundar, R. Bharat Rao)
Siemens Medical Solutions Inc., USA

July 8, 2008

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 1 / 41


Outline of the talk

1 Multiple Instance Learning

2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization

3 Feature Selection

4 Experiments

5 Multi-task Learning

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 2 / 41


Binary Classification
Predict whether an example belongs to class ’1’ or class ’0’

Computer Aided Diagnosis


Given a region in a mammogram predict whether it is cancer(1) or not(0).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 3 / 41


Binary Classification
Predict whether an example belongs to class ’1’ or class ’0’

Computer Aided Diagnosis


Given a region in a mammogram predict whether it is cancer(1) or not(0).

Text Categorization
Given a text predict whether it pertains to a given topic(1) or not(0).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 3 / 41


Binary Classification
Predict whether an example belongs to class ’1’ or class ’0’

Computer Aided Diagnosis


Given a region in a mammogram predict whether it is cancer(1) or not(0).

Text Categorization
Given a text predict whether it pertains to a given topic(1) or not(0).

Binary Classifier
Given a feature vector x ∈ Rd predict the class label y ∈ {1, 0}.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 3 / 41


Linear Binary Classifier

Given a feature vector x ∈ Rd and a weight vector w ∈ Rd

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 4 / 41


Linear Binary Classifier

Given a feature vector x ∈ Rd and a weight vector w ∈ Rd



1 if w T x > θ
y= .
0 if w T x < θ

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 4 / 41


Linear Binary Classifier

Given a feature vector x ∈ Rd and a weight vector w ∈ Rd



1 if w T x > θ
y= .
0 if w T x < θ

The threshold θ determines the operating point of the classifier.


The ROC curve is obtained as θ is swept from −∞ to ∞.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 4 / 41


Linear Binary Classifier

Given a feature vector x ∈ Rd and a weight vector w ∈ Rd



1 if w T x > θ
y= .
0 if w T x < θ

The threshold θ determines the operating point of the classifier.


The ROC curve is obtained as θ is swept from −∞ to ∞.

Training/Learning a classifier implies


Given training data D consisting of N examples D = {xi , yi }N
i =1
Choose the weight vector w .

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 4 / 41


Labels for the training data

Single Instance Learning


every example xi has a label yi ∈ {0, 1}

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 5 / 41


Labels for the training data

Single Instance Learning


every example xi has a label yi ∈ {0, 1}

Multiple Instance Learning


a group of examples (bag) xi = {xij ∈ Rd }K
j=1 share a common label
i

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 5 / 41


Single ’vs’ Multiple Instance Learning

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 6 / 41


MIL applications

A natural framework for many applications and often found to be


superior than a conventional supervised learning approach.
Drug Activity Prediction.
Face Detection.
Stock Selection
Content based image retrieval.
Text Classification.
Protein Family Modeling.
Computer Aided Diagnosis.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 7 / 41


Computer Aided Diagnosis as a MIL problem
Digital Mammography

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 8 / 41


Computer Aided Diagnosis as a MIL problem
Pulmonary Embolism Detection

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 9 / 41


Our notion of Bags

Bag
A bag contains many instances.
All the instances in a bag share the same label.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 10 / 41


Our notion of Bags

Bag
A bag contains many instances.
All the instances in a bag share the same label.

Positive Bag
A bag is labeled positive if it contains at least one positive instance.

For a radiologist
A lesion is detected if at least one of the candidate which overlaps with it
is detected.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 10 / 41


Our notion of Bags

Bag
A bag contains many instances.
All the instances in a bag share the same label.

Positive Bag
A bag is labeled positive if it contains at least one positive instance.

For a radiologist
A lesion is detected if at least one of the candidate which overlaps with it
is detected.

Negative Bag
A negative bag means that all instances in the bag are negative.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 10 / 41


MIL Illustration
Single instance Learning ’vs’ Multiple instance learning

x2 x2

x1 x1

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 11 / 41


Outline of the talk

1 Multiple Instance Learning

2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization

3 Feature Selection

4 Experiments

5 Multi-task Learning

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 12 / 41


Proposed algorithm
Key features

MIRVM–Multiple Instance Relevance Vector Machine


Logistic Regression classifier which handles MIL scenario.
Joint feature selection and classifier learning in a Bayesian paradigm.
Extension to multi-task learning.
Very fast.
Easy to use. No tuning parameters.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 13 / 41


Training Data
Consists of N bags

Notation
We represent an instance as a feature vector x ∈ Rd .

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 14 / 41


Training Data
Consists of N bags

Notation
We represent an instance as a feature vector x ∈ Rd .
A bag which contains K instances is denoted by boldface
x = {xj ∈ Rd }K
j=1 .

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 14 / 41


Training Data
Consists of N bags

Notation
We represent an instance as a feature vector x ∈ Rd .
A bag which contains K instances is denoted by boldface
x = {xj ∈ Rd }K
j=1 .
The label of a bag is denoted by y ∈ {0, 1}.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 14 / 41


Training Data
Consists of N bags

Notation
We represent an instance as a feature vector x ∈ Rd .
A bag which contains K instances is denoted by boldface
x = {xj ∈ Rd }K
j=1 .
The label of a bag is denoted by y ∈ {0, 1}.

Training Data
The training data D consists of N bags D = {xi , yi }N
i =1 , where
xi = {xij ∈ Rd }K
j=1 is a bag containing Ki instances
i

and share the same label yi ∈ {0, 1}.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 14 / 41


Classifier form
We consider linear classifiers

Linear Binary Classifier


Acts on a given instance fw (x) = w T x

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 15 / 41


Classifier form
We consider linear classifiers

Linear Binary Classifier


Acts on a given instance fw (x) = w T x

1 if w T x > θ
y= .
0 if w T x < θ

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 15 / 41


Single Instance Model
Logistic regression

Link function
The probability for the positive class is modeled as a logistic sigmoid
acting on the linear classifier fw , i.e.,

p(y = 1|x) = σ(w ⊤ x),

where σ(z) = 1/(1 + e −z ).


We modify this for the multiple instance learning scenario.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 16 / 41


Multiple Instance Model
Logistic regression
Positive Bag
A bag is labeled positive if it contains at least one positive instance.

p(y = 1|x) = 1 − p(all instances are negative)


K
Y K h
Y i

= 1− [1 − p(y = +1|xj )] = 1 − 1 − σ(w xj ) ,
j=1 j=1

where the bag x = {xj }K


j=1 contains K examples.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 17 / 41


Multiple Instance Model
Logistic regression
Positive Bag
A bag is labeled positive if it contains at least one positive instance.

p(y = 1|x) = 1 − p(all instances are negative)


K
Y K h
Y i

= 1− [1 − p(y = +1|xj )] = 1 − 1 − σ(w xj ) ,
j=1 j=1

where the bag x = {xj }K


j=1 contains K examples.

Negative Bag
A negative bag means that all instances in the bag are negative.
K
Y K h
Y i
p(y = 0|x) = p(y = 0|xj ) = 1 − σ(w ⊤ xj ) .
j=1 j=1

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 17 / 41


Maximum Likelihood (ML) Estimator

ML estimate
Given the training data D the ML estimate for w is given by

bML = arg max [log p(D|w )] .


w
w

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 18 / 41


Maximum Likelihood (ML) Estimator

ML estimate
Given the training data D the ML estimate for w is given by

bML = arg max [log p(D|w )] .


w
w

Log-likelihood
Assuming that the training bags are independent
N
X
log p(D|w ) = yi log pi + (1 − yi ) log(1 − pi ).
i =1

QKi  
where pi = 1 − j=1 1 − σ(w ⊤ xij ) is the probability that the i th bag xi
is positive.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 18 / 41


MAP estimator
Regularization

ML estimator can exhibit severe over-fitting especially for high-dimensional


data.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 19 / 41


MAP estimator
Regularization

ML estimator can exhibit severe over-fitting especially for high-dimensional


data.

MAP estimator
Use a prior on w and then find the maximum a-posteriori (MAP) solution.

bMAP = arg max p(w /D)


w
w
= arg max [log p(D/w ) + log p(w )] .
w

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 19 / 41


Our prior

Gaussian Prior
Zero mean Gaussian with inverse variance (precision) αi .

p(wi |αi ) = N (wi |0, 1/αi ).

We assume that individual weights are independent.


d
Y
p(w ) = p(wi |αi ) = N (w |0, A−1 ).
i =1

A = diag (α1 . . . αd )-also called hyper-parameters.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 20 / 41


The final MAP Estimator
The optimization problem
Substituting for the log likelihood and the prior we have

bMAP = arg max L(w ).


w
w

where
" N #
X w ⊤ Aw
L(w ) = yi log pi + (1 − yi ) log(1 − pi ) − ,
2
i =1

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 21 / 41


The final MAP Estimator
The optimization problem
Substituting for the log likelihood and the prior we have

bMAP = arg max L(w ).


w
w

where
" N #
X w ⊤ Aw
L(w ) = yi log pi + (1 − yi ) log(1 − pi ) − ,
2
i =1

Newton-Raphson method
w t+1 = w t − ηH−1 g,
where g is the gradient vector, H is the Hessian matrix, and η is the step
length.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 21 / 41


Outline of the talk

1 Multiple Instance Learning

2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization

3 Feature Selection

4 Experiments

5 Multi-task Learning

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 22 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).
If we know the hyper-parameters we can compute the MAP estimate.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).
If we know the hyper-parameters we can compute the MAP estimate.
As the precision αk → ∞, i.e, the variance for wk tends to zero (thus
concentrating the prior sharply at zero).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).
If we know the hyper-parameters we can compute the MAP estimate.
As the precision αk → ∞, i.e, the variance for wk tends to zero (thus
concentrating the prior sharply at zero).
posterior ∝ likelihood × prior
Hence, regardless of the evidence of the training data, the posterior
for wk will also be sharply concentrated on zero.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).
If we know the hyper-parameters we can compute the MAP estimate.
As the precision αk → ∞, i.e, the variance for wk tends to zero (thus
concentrating the prior sharply at zero).
posterior ∝ likelihood × prior
Hence, regardless of the evidence of the training data, the posterior
for wk will also be sharply concentrated on zero.
Thus that feature will not affect the classification result-hence, it is
effectively removed out via feature selection.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).
If we know the hyper-parameters we can compute the MAP estimate.
As the precision αk → ∞, i.e, the variance for wk tends to zero (thus
concentrating the prior sharply at zero).
posterior ∝ likelihood × prior
Hence, regardless of the evidence of the training data, the posterior
for wk will also be sharply concentrated on zero.
Thus that feature will not affect the classification result-hence, it is
effectively removed out via feature selection.
Therefore, the discrete optimization problem corresponding to feature
selection, can be more easily solved via an easier continuous
optimization over hyper-parameters.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters

We imposed a prior of the form p(w ) = N (w |0, A−1 ), parameterized


by d hyper-parameters A = diag (α1 . . . αd ).
If we know the hyper-parameters we can compute the MAP estimate.
As the precision αk → ∞, i.e, the variance for wk tends to zero (thus
concentrating the prior sharply at zero).
posterior ∝ likelihood × prior
Hence, regardless of the evidence of the training data, the posterior
for wk will also be sharply concentrated on zero.
Thus that feature will not affect the classification result-hence, it is
effectively removed out via feature selection.
Therefore, the discrete optimization problem corresponding to feature
selection, can be more easily solved via an easier continuous
optimization over hyper-parameters.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 23 / 41


Feature Selection
Choosing the hyper-parameters to maximize the marginal likelihood

Type-II marginal likelihood approach for prior selection


Z
b = arg max p(D|A) = arg max
A p(D|w )p(w |A)dw .
A A

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 24 / 41


Feature Selection
Choosing the hyper-parameters to maximize the marginal likelihood

Type-II marginal likelihood approach for prior selection


Z
b = arg max p(D|A) = arg max
A p(D|w )p(w |A)dw .
A A

What hyper-parameters best describe the observed data?

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 24 / 41


Feature Selection
Choosing the hyper-parameters to maximize the marginal likelihood

Type-II marginal likelihood approach for prior selection


Z
b = arg max p(D|A) = arg max
A p(D|w )p(w |A)dw .
A A

What hyper-parameters best describe the observed data?


Not easy to compute.
We use an approximation to the marginal likelihood via the Taylor
series expansion around the MAP estimate.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 24 / 41


Feature Selection
Choosing the hyper-parameters to maximize the marginal likelihood

Type-II marginal likelihood approach for prior selection


Z
b = arg max p(D|A) = arg max
A p(D|w )p(w |A)dw .
A A

What hyper-parameters best describe the observed data?


Not easy to compute.
We use an approximation to the marginal likelihood via the Taylor
series expansion around the MAP estimate.

Approximation to log marginal likelihood log p(D|A)

1 ⊤ 1 1
bMAP ) − w
log p(D|w bMAP + log |A| − log | − H(w
bMAP Aw bMAP , A)|.
2 2 2

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 24 / 41


Feature Selection
Choosing the hyper-parameters

Update Rule for hyperparameters


A simple update rule for the hyperparameters can be written by equating
the first derivative to zero.
1
αnew
i = ,
wi2 + Σii

where Σii is the i th diagonal element of H−1 (w


bMAP , A)I.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 25 / 41


Feature Selection
Choosing the hyper-parameters

Update Rule for hyperparameters


A simple update rule for the hyperparameters can be written by equating
the first derivative to zero.
1
αnew
i = ,
wi2 + Σii

where Σii is the i th diagonal element of H−1 (w


bMAP , A)I.

Relevance vector Machine for MIL


In an outer loop we update the hyperparameters A.
In an inner loop we find the MAP estimator w
bMAP given A.
After a few iterations we find that the hyperparameters for several
features tend to infinity.
This means that we can simply remove those irrelevant features.
Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 25 / 41
Outline of the talk

1 Multiple Instance Learning

2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization

3 Feature Selection

4 Experiments

5 Multi-task Learning

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 26 / 41


Benchmark Experiments

Datasets
Dataset Features positive negative
examples bags examples bags
Musk1 166 207 47 269 45
Musk2 166 1017 39 5581 63
Elephant 230 762 100 629 100
Tiger 230 544 100 676 100

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 27 / 41


Experiments

Methods compared
MI RVM Proposed method.
MI Proposed method without feature selection.
RVM Proposed method without MIL.
MI LR MIL variant of Logistic Regression. (Settles et al., 2008)
MI SVM MIL variant of SVM. (Andrews et al., 2002)
MI Boost MIL variant of AdaBoost. (Xin and Frank, 2004)

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 28 / 41


Experiments

Evaluation Procedure
10-fold stratified cross-validation.
We plot the Receiver Operating Characteristics (ROC) curve for
various algorithms.
The True Positive Rate is computed on a bag level.
The ROC curve is plotted by pooling the prediction of the algorithm
across all folds.
We also report the area under the ROC curve (AUC).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 29 / 41


AUC Comparison

Area under the ROC Curve


Set MIRVM RVM MIBoost MILR MISVM MI
Musk1 0.942 0.951 0.899 0.846 0.899 0.922
Musk2 0.987 0.985 0.964 0.795 - 0.982
Elephant 0.962 0.979 0.828 0.814 0.959 0.953
Tiger 0.980 0.970 0.890 0.890 0.945 0.956

Observations
(1) The proposed method MIRVM and RVM clearly perform better.
(2) For some datasets RVM is better, i.e, MIL does not help.
(3) Feature selection helps (MIRVM is better than MI).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 30 / 41


ROC Comparison

Musk1
1

0.9

0.8

0.7
True Positive Rate

0.6

0.5

0.4
MIRVM
0.3
RVM
0.2 MIBoost
MILR
0.1 MISVM
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 31 / 41


ROC Comparison

Musk2
1

0.9

0.8

0.7
True Positive Rate

0.6

0.5

0.4

0.3 MIRVM
0.2 RVM
MIBoost
0.1 MILR
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 32 / 41


ROC Comparison

Tiger
1

0.9

0.8

0.7
True Positive Rate

0.6

0.5

0.4
MIRVM
0.3 RVM
0.2 MIBoost
MILR
0.1 MISVM
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 33 / 41


ROC Comparison

Elephant
1

0.9

0.8

0.7
True Positive Rate

0.6

0.5

0.4
MIRVM
0.3 RVM
0.2 MIBoost
MILR
0.1 MISVM
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 34 / 41


Features selected

The average number of features selected


Dataset Number selected by selected by selected by
of features RVM MI RVM MI Boost
Musk1 166 39 14 33
Musk2 166 90 17 32
Elephant 230 42 16 33
Tiger 230 56 19 37

Observation
Multiple instance learning (MIRVM) selects much less features than single
instance learning (RVM).

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 35 / 41


PECAD Experiments
Selected 21 out of 134 features.
PECAD bag level FROC Curve
1

0.8
Sensitivity

0.6

0.4

0.2 MI RVM
RVM
MI Boost
0
0 5 10 15 20
False Positives/ Volume
Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 36 / 41
Outline of the talk

1 Multiple Instance Learning

2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization

3 Feature Selection

4 Experiments

5 Multi-task Learning

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 37 / 41


Multi-task Learning
Learning multiple related classifiers.
May have a shortage of training data for learning classifiers for a task.
Multi-task learning can exploit information from other datasets.
The classifiers share a common prior.
A separate classifier is trained for each task.
However the optimal hyper-parameters of the shared prior are estimated
from all the data sets simultaneously.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 38 / 41


Multi-task Learning
LungCAD nodule (solid and GGOs) detection

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 39 / 41


Multi-task Learning Experiments
The bag level FROC curve for the solid validation set.

0.8
Nodule Level Sensitivity

0.6

0.4

0.2
Multi Task Learning
Single Task Learning
0
0 2 4 6 8 10
False Positives / Volume
Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 40 / 41
Conclusion

MIRVM–Multiple Instance Relevance Vector Machine


Joint feature selection and classifier learning in the MIL scenario.
MIL selects much sparser models.
More accurate and faster than some competing methods.
Extension to multi-task learning.

Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 41 / 41

You might also like