Bayesian PDF
Bayesian PDF
July 8, 2008
2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization
3 Feature Selection
4 Experiments
5 Multi-task Learning
Text Categorization
Given a text predict whether it pertains to a given topic(1) or not(0).
Text Categorization
Given a text predict whether it pertains to a given topic(1) or not(0).
Binary Classifier
Given a feature vector x ∈ Rd predict the class label y ∈ {1, 0}.
Bag
A bag contains many instances.
All the instances in a bag share the same label.
Bag
A bag contains many instances.
All the instances in a bag share the same label.
Positive Bag
A bag is labeled positive if it contains at least one positive instance.
For a radiologist
A lesion is detected if at least one of the candidate which overlaps with it
is detected.
Bag
A bag contains many instances.
All the instances in a bag share the same label.
Positive Bag
A bag is labeled positive if it contains at least one positive instance.
For a radiologist
A lesion is detected if at least one of the candidate which overlaps with it
is detected.
Negative Bag
A negative bag means that all instances in the bag are negative.
x2 x2
x1 x1
2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization
3 Feature Selection
4 Experiments
5 Multi-task Learning
Notation
We represent an instance as a feature vector x ∈ Rd .
Notation
We represent an instance as a feature vector x ∈ Rd .
A bag which contains K instances is denoted by boldface
x = {xj ∈ Rd }K
j=1 .
Notation
We represent an instance as a feature vector x ∈ Rd .
A bag which contains K instances is denoted by boldface
x = {xj ∈ Rd }K
j=1 .
The label of a bag is denoted by y ∈ {0, 1}.
Notation
We represent an instance as a feature vector x ∈ Rd .
A bag which contains K instances is denoted by boldface
x = {xj ∈ Rd }K
j=1 .
The label of a bag is denoted by y ∈ {0, 1}.
Training Data
The training data D consists of N bags D = {xi , yi }N
i =1 , where
xi = {xij ∈ Rd }K
j=1 is a bag containing Ki instances
i
Link function
The probability for the positive class is modeled as a logistic sigmoid
acting on the linear classifier fw , i.e.,
Negative Bag
A negative bag means that all instances in the bag are negative.
K
Y K h
Y i
p(y = 0|x) = p(y = 0|xj ) = 1 − σ(w ⊤ xj ) .
j=1 j=1
ML estimate
Given the training data D the ML estimate for w is given by
ML estimate
Given the training data D the ML estimate for w is given by
Log-likelihood
Assuming that the training bags are independent
N
X
log p(D|w ) = yi log pi + (1 − yi ) log(1 − pi ).
i =1
QKi
where pi = 1 − j=1 1 − σ(w ⊤ xij ) is the probability that the i th bag xi
is positive.
MAP estimator
Use a prior on w and then find the maximum a-posteriori (MAP) solution.
Gaussian Prior
Zero mean Gaussian with inverse variance (precision) αi .
where
" N #
X w ⊤ Aw
L(w ) = yi log pi + (1 − yi ) log(1 − pi ) − ,
2
i =1
where
" N #
X w ⊤ Aw
L(w ) = yi log pi + (1 − yi ) log(1 − pi ) − ,
2
i =1
Newton-Raphson method
w t+1 = w t − ηH−1 g,
where g is the gradient vector, H is the Hessian matrix, and η is the step
length.
2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization
3 Feature Selection
4 Experiments
5 Multi-task Learning
1 ⊤ 1 1
bMAP ) − w
log p(D|w bMAP + log |A| − log | − H(w
bMAP Aw bMAP , A)|.
2 2 2
2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization
3 Feature Selection
4 Experiments
5 Multi-task Learning
Datasets
Dataset Features positive negative
examples bags examples bags
Musk1 166 207 47 269 45
Musk2 166 1017 39 5581 63
Elephant 230 762 100 629 100
Tiger 230 544 100 676 100
Methods compared
MI RVM Proposed method.
MI Proposed method without feature selection.
RVM Proposed method without MIL.
MI LR MIL variant of Logistic Regression. (Settles et al., 2008)
MI SVM MIL variant of SVM. (Andrews et al., 2002)
MI Boost MIL variant of AdaBoost. (Xin and Frank, 2004)
Evaluation Procedure
10-fold stratified cross-validation.
We plot the Receiver Operating Characteristics (ROC) curve for
various algorithms.
The True Positive Rate is computed on a bag level.
The ROC curve is plotted by pooling the prediction of the algorithm
across all folds.
We also report the area under the ROC curve (AUC).
Observations
(1) The proposed method MIRVM and RVM clearly perform better.
(2) For some datasets RVM is better, i.e, MIL does not help.
(3) Feature selection helps (MIRVM is better than MI).
Musk1
1
0.9
0.8
0.7
True Positive Rate
0.6
0.5
0.4
MIRVM
0.3
RVM
0.2 MIBoost
MILR
0.1 MISVM
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Musk2
1
0.9
0.8
0.7
True Positive Rate
0.6
0.5
0.4
0.3 MIRVM
0.2 RVM
MIBoost
0.1 MILR
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Tiger
1
0.9
0.8
0.7
True Positive Rate
0.6
0.5
0.4
MIRVM
0.3 RVM
0.2 MIBoost
MILR
0.1 MISVM
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Elephant
1
0.9
0.8
0.7
True Positive Rate
0.6
0.5
0.4
MIRVM
0.3 RVM
0.2 MIBoost
MILR
0.1 MISVM
MI
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Observation
Multiple instance learning (MIRVM) selects much less features than single
instance learning (RVM).
0.8
Sensitivity
0.6
0.4
0.2 MI RVM
RVM
MI Boost
0
0 5 10 15 20
False Positives/ Volume
Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 36 / 41
Outline of the talk
2 Proposed algorithm
Training Data
Classifier form
Model
Estimator
Regularization
Optimization
3 Feature Selection
4 Experiments
5 Multi-task Learning
0.8
Nodule Level Sensitivity
0.6
0.4
0.2
Multi Task Learning
Single Task Learning
0
0 2 4 6 8 10
False Positives / Volume
Vikas C. Raykar (Siemens) ICML 2008 July 8, 2008 40 / 41
Conclusion