(Babenko2009) Multiple Instance Learning - Algorithms and Applications
(Babenko2009) Multiple Instance Learning - Algorithms and Applications
Boris Babenko
Dept. of Computer Science and Engineering
University of California, San Diego
Abstract
Traditional supervised learning requires a training data set that consists of inputs and corre-
sponding labels. In many applications, however, it is difficult or even impossible to accurately
and consistently assign labels to inputs. A relatively new learning paradigm called Multi-
ple Instance Learning allows the training of a classifier from ambiguously labeled data. This
paradigm has been receiving much attention in the last several years, and has many useful
applications in a number of domains (e.g. computer vision, computer audition, bioinformat-
ics, text processing). In this report we review several representative algorithms that have been
proposed to solve this problem. Furthermore, we discuss a number of existing and potential
applications, and how well the currently available algorithms address the problems presented
by these applications.
1 Introduction
In traditional supervised learning a training dataset, consisting of input and output/label pairs, is
used to construct a classifier that can predict outputs/labels for novel inputs [1, 2]. An enormous
amount of theoretical and practical work has gone into developing robust algorithms for this prob-
lem. However, the requirement of input/label pairs in the training data is surprisingly prohibitive in
certain applications, and more recent research has focused on developing more flexible paradigms
for learning. In this paper we focus on the Multiple Instance Learning (MIL) paradigm, which has
been emerging as a useful tool in a number of application domains. In this paradigm the data is
assumed to have some ambiguity in how the labels are assigned. In particular, rather than pro-
viding the learning algorithm with input/label pairs, labels are assigned to sets or bags of inputs.
Restricting the labels to be binary, the MIL assumption is that every positive bag contains at least
one positive input. The true input labels can be thought of as latent variables, because they are not
known during training.
Consider the following simple example (adapted from [3]) of a MIL problem, illustrated in
Fig. 1. There are several faculty members, and each owns a key chain that contains a few keys.
You know that some of these faculty members are able to enter a certain room, and some aren’t.
The task is then to predict whether a certain key or a certain key chain can get you into this room.
To solve this we need to find the key that all the “positive” key chains have in common. Note that
1
Serge’s key‐chain Sanjoy’s key‐chain Lawrence’s key‐chain
if we can correctly identify this key, we can also correctly classify an entire key chain - either it
contains the required key, or it doesn’t.
The spirit of MIL emerged as early as 1990 when Keeler et al. designed a neural network
algorithm that would find the best segmentation of a hand written digit during training [4]. In other
words, the digit label was assigned to every possible (block) segmentation of a digit image, rather
than to a tightly cropped image. The idea was further developed, and the actual term was coined
by Dietterich et al. in [3]. The problem that inspired this work was that of drug discovery, where
some property of the molecule has to be predicted from its shape statistics. The ambiguity comes
into play because each molecule can twist and bend into several different distinct shapes, and it is
not known which shape is responsible for the particular property (see Figure 4). In Section 3 we
will see many more interesting applications that fit into the MIL paradigm.
Although MIL has received an increasing amount of attention in recent years, the problem is
still fairly undeveloped and there are many interesting open questions. The goal of this report is to
provide a review of some representative algorithms for MIL, as well as some interesting applica-
tions. Finally, we discuss how well the existing algorithms address the problems presented by the
applications and point out potential avenues for future research.
2 Algorithms
The purpose of this section is to formally define the MIL problem, and review some popular al-
gorithms for solving MIL. For the purpose of clarity, it is useful to divide all existing methods
into a hierarchy, but, as is usually the case, this is impossible to do as various methods overlap in
various ways. As with supervised learning, the training procedure for MIL will require two basic
ingredients: a cost function (e.g. 0/1 loss, likelihood), and a method of finding a classifier that opti-
mizes that cost function (e.g. gradient descent, heuristic search). We chose to structure this review
by splitting the methods up according to the former criteria, but it should be noted that there are
methods with different cost functions that are actually very similar in terms of their optimization
procedures. Next we define the MIL problem formally, review the most basic way of solving MIL as
well as some PAC algorithm and theoretic results; finally, we conclude with a review of maximum
2
likelihood and maximum margin MIL algorithms.
yi = max{yij } (2)
j
3
xi 2 X ith instance (supervised)
Xi = {xi1 , xi2 . . . , xim } ith bag with m instances xij (MIL)
yi 2 Y label of ith instance (supervised) or ith bag (MIL)
yij true label of j th instance in ith bag
y0 (2y 1)
n instances (supervised) or number of bags (MIL)
m number of instances per bag
d dimensionality
h(x) instance classifier
H(X) bag classifier
Still the question remains: can a standard supervised learning algorithm perform comparably to a
MIL algorithm? Many papers have investigated this question, and have shown for a variety of
datasets and algorithms (both supervised and MIL) that using the MIL framework results in better
performance [3, 9, 10, 11]. MIL algorithms tend to outperform this “naive MIL” approach in practice
by a large margin. Nevertheless, this is an important scenario to consider, and we will see that an
approach similar to this leads to interesting theoretical results.
4
a noise tolerant learning algorithm [16]. This is similar to the naive MIL approach we discussed
in the previous section. The key to this algorithm is that the positive instances will always have
a correct label. However, the negative instances may be labeled positive (if they came from a
positive bag). Therefore, the data set that we produce will have asymmetric label noise. With
this reduction, Blum shows that any concept class that is PAC learnable with noise is also PAC
learnable for MIL. The best sample complexity bound is also due to Blum. This algorithm is a
more sophisticated extension of the above, and uses the Statistical Query framework [16] (that was
used to prove PAC bounds for noise-tolerant learning by Kearns et al.). The sample complexity
bound of this MIL algorithm is Õ(d2 m/✏2 ). Note that this implies that learning becomes more
difficult as ✏ decreases, as the dimensionality d increases, and as the bag size m increases.
Although Blum’s algorithm achieved the best known PAC bounds for MIL, it is clearly not a
practical algorithm. For large bag sizes m we would essentially be throwing away vast amounts of
data. However, this algorithm is interesting to study because it highlights the relationship between
supervised learning and MIL. The algorithms we review in the sections below generally have
fewer known theoretical properties, but are more practical and have been shown to perform well on
difficult data.
To adapt this loss function for MIL we need to remove the dependence on instance labels, as these
are not known during training. Instead, we let pi ⌘ P(yi = 1|Xi , h) be the posterior probability of
the bag, and pij ⌘ P(yij = 1|xij , h) be the posterior probability of an instance. The equation for
log likelihood remains the same, but note that pi is now the bag probability. As before, the exact
definition of pij will depend on the classifier.
Finally, we need to connect the bag probability pi to the probabilities of its instances pij . A
natural way of defining this relationship is:
pi = max{pij } (4)
j
5
1
MAX
NOR
0.9
ISR
GM(r=5)
0.8 GM(r=20)
LSE(r=5)
gi(vi)
LSE(r=20)
0.7
0.6
0.5
Intuitively, if vi is the unique max, then changing vi changes the max by the same amount, other-
wise changing vi does not affect the max.
A number of approximations for g have been proposed. We summarize the choices used here in
Table 2: a variant of log-sum-exponential (LSE) [17, 18], generalized mean (GM), noisy-or (NOR)
[10], and the ISR model [4, 10]. For experimental comparisons of these different choices within
MIL we refer the reader to [19]. In Fig. 2 we show the different models applied to (v, 1 v) for
v 2 [0, 1].
LSE and GM each have a parameter r the controls their sharpness and accuracy; g` (v` ) !
v⇤ as r ! 1 (note that large r can lead to numerical instability). For LSE one can show that
v⇤ log(m)/r g` (v` ) v⇤ [17] and for GM that (1/m)1/r v⇤ g` (v` ) v⇤ , where m = |v` |.
NOR and ISR are only defined over [0, 1]. Both have probabilistic interpretations and work well in
6
practice; however, these models are best suited for small m as g` (v` ) ! 1 as m ! 1. Finally, all
models are exact for m = 1, and if 8` v` 2 [0, 1], then 0 g` (v` ) 1 for all models.
1 P r r vir 1
1
GM m ` v` g` (v` ) P r [0, 1]
` v`
Q 1 g` (v` )
NOR 1 ` (1 v` ) 1 vi [0, 1]
P 0 ⇣ ⌘2
` v` v` 1 g` (v` )
ISR P
1+ ` v`0
, v`0 = 1 v` 1 vi [0, 1]
Now we are ready to go over a few specific maximum likelihood algorithms for MIL. For each,
we will define the classifier h and the instance probability pij in terms of h.
We can think of the above as a probabilistic version of linear regression. Note that the range of
the sigmoid is between 0 and 1, so plugging the dot product (w · xij ) into this function produces a
valid probability value. Another interpretation of logistic regression is that w is a normal vector to
a separating hyperplane, and the probability of an instance depends on the distance between it and
the hyperplane. While in standard logistic regression we would directly plug these probabilities
into the likelihood cost function, for MIL we must first compute the bag probabilities pi and then
plug the resulting values into the likelihood. If we use one of the max approximations discussed
above, this likelihood will be differentiable with respect to w. We can therefore use a standard
unconstrained optimization procedure (e.g. gradient descent) to find a good value for w. A similar,
though more sophisticated, model was proposed in [20].
7
2.4.3 Diverse Density
Maron et al. [21] proposed an algorithm called Diverse Density for MIL. Unlike many of the
other MIL algorithm, this method does not have a close relative in supervised learning (though it is
reminiscent of nearest neighbor). The classifier again consists of just one vector in the input space:
h = {w}, w 2 Rd , which we will call the target point. Rather than defining a hyperplane, however,
this vector specifies a point in the input space that is “most positive”. The probability of an instance
xij depends on the distance between it and this target point:
In other words, the positive region in the space is defined by a Gaussian centered at the point w.
Having defined this, we can again plug the instance probabilities into one of the max approxima-
tions, and then into the likelihood ([21] uses the NOR model). Gradient descent can be used to
find an optimal w, and Maron et al. suggest to use points in the positive bags to initialize (because
presumably the vector w will lie close to at least one instance in every positive bag). Several ex-
tensions can be incorporated into this framework. First, we can easily include a weighting of the
features, rather than using basic Euclidean distance:
X
pij = exp{ sk (xijk wk )2 } (10)
k
It is then straight forward to modify the gradient descent to find both w and s. Furthermore, rather
than having just one target point w, it is also straight forward to modify this algorithm to find
several target points. The instance probabilities can then be calculated by taking a max over all the
target points. This makes the classifier more powerful, and enables it to learn positive classes that
are more sophisticated (essentially a mixture of Gaussians).
The EM-DD [22] algorithm uses the same cost function as the above, but the optimization
technique is different. In this algorithm, the authors chose not to use a soft approximation of
max. As we mentioned before, this leads to a non-differentiable likelihood function. Zhang et al.
therefore propose the following simple heuristic algorithm to find the optimal target point w. In
the spirit of EM, the algorithm iterates over two steps: (1) use the current estimate of the classifier
to pick the most probable point from each positive bag, and (2) find a new target point w by
maximizing likelihood over all negative instances and the positive instances from the previous step
(can use any gradient descent variant to do this since the max is no longer a problem). In other
words, step (2) reduces to standard supervised learning, rather than MIL. We will see a similar
search heuristic used in one of the maximum margin algorithms in Section 2.5
2.4.4 Boosting
Boosting [23] is a popular, powerful tool in statistical machine learning. Many different boosting
algorithms for standard supervised learning exist in the literature, and these algorithms have been
shown to be very effective in many applications. The basic idea behind boosting is to take a simple
8
learning algorithm, train several simple classifiers, and combine them.
PT Typically, the combination
is done via weighted sum, the classifier is defined as: h(x) = sign t=1 ↵t ht (x) , where ht 2 H
is a “weak classifier” (sometimes called the “base classifier”), and ↵t are the positive weights.
Viola et al. proposed a boosting algorithm for solving MIL in [10]. They define the probability
of an instance as
⇣X ⌘
pij = ↵t ht (xij ) (11)
t
This is similar to posterior probability estimates for AdaBoost and LogitBoost [24]. They then
use the NOR and ISR max approximations to define the bag probabilities pi . Using the gradient
boosting framework [25, 26], they derive an boosting procedure that optimizes the bag likelihood.
Gradient boosting works by training one weak classifier at a time, and treating each of them as a
gradient descent step in function space. To do this, we first compute the gradient @L/@h and then
find a weak classifier ht that is as close as possible to this gradient (note that it may be impossible
to move exactly in the direction of the gradient because we are limited in our choice of ht ). The
weight ↵t can then be found via line search.
9
‐ negative instances
‐ positive bag 1
‐ positive bag 2
‐ support vector
MI‐SVM mi‐SVM
Figure 3: Comparing the solutions of MI-SVM and mi-SVM for a simple 2D dataset.
or negative point and this hyperplane. Because the data can be scaled arbitrarily, it can be shown
that the following optimization is equivalent to maximizing the margin:
P
min 2 ||w||2 + C
1 2
i ⇠i (12)
w,b,⇠
s.t. yi (w · xi + b) 1 ⇠i
The above can be interpreted as follows: we assume that the margin is at least 1, and shrink the
size of the hyperplane normal w. The latter can also be seen as a form of regularization. Because
the data may actually not be separable, we also include slack variables ⇠i for each point xi . The
points that are closest to the hyperplane (the ones that lie away) are called support vectors.
10
the bag must be bigger than one. Incorporating slack variables, we arrive at the following program:
P
min 2 ||w||2 + C
1 2
ij ⇠ij (13)
w,b,⇠
s.t. (w · xij + b) 1 + ⇠ij , 8i|yi = 0
maxj (w · xij + b) 1 ⇠ij , 8i|yi = 1
⇠ij 0
The second constraint in this optimization is not convex, which makes this a difficult problem to
solve. By introducing an extra variable s(i) for each bag, we can convert the above program into a
mixed integer program:
P
min min 2 ||w||2 + C
1 2
ij ⇠ij (14)
s(i) w,b,⇠
s.t. (w · xij + b) 1 + ⇠ij , 8i|yi = 0
(w · xis(i) + b) 1 ⇠ij , 8i|yi = 1
⇠ij 0
The intuition of this program is that the variables s(i) are trying to locate the witness instance in
each bag. Note that the constraints involve just one instance per positive bag. This means that the
potentially negative instance in a positive bag will be ignored. This version of SVM for MIL is
called MI-SVM.
Mixed integer programs are difficult to solve; therefore, Andrews et al. propose a simple heuris-
tic algorithm for solving the above program. The heuristic has the flavor of an EM algorithm: first
the values of s(i) are guessed using the current classifier (by choosing the instance in every positive
bag with the largest margin), and then the SVM classifier is trained just as in regular supervised
learning. These two steps are repeated until convergence, when the values of s(i) stop changing.
Note that this optimization procedure is very similar to the one used in EM-DD, although the ob-
jective function is different.
11
program is summarized below:
P
min min 1 2
2 ||w||2 +C ij ⇠ij (15)
yij w,b,⇠
s.t. yij = 0, 8i|yi = 0
P
j yij 1, 8i|yi = 1
0 (w · x + b)
yij 1 ⇠ij
ij
⇠ij 0
This variation of SVM for MIL is called mi-SVM. This is also a mixed integer program, and is also
difficult to solve. As before, Andrews et al. propose a heuristic algorithm that is similar to the one
we described for MI-SVM. The algorithm again iterates over two steps: use the current classifier
to compute instance labels yij , then use these to train a standard SVM. If for a positive bag, none
of the instance labels come out as positive, the instance with the largest value of (w · xij + b) is
labeled positive.
Figure 3 shows a simple example that illustrates the differences between MI-SVM and mi-
SVM. The important difference is that in the former approach, the negative instance in the positive
bags are essentially ignored, and only one instance in each positive bag contributes to the optimiza-
tion of the hyperplane; in the latter, the negative instances in the positive bags, as well as multiple
positive instances from one bag can be support vectors.
12
is large enough. Adding the slack variables, we arrive at the following program:
P
min 2 ||w||2 + C
1 2
ij ⇠ij (16)
w,b,⇠
s.t. (w · xij + b) 1 + ⇠ij , 8i|yi = 0
P
j (w · xij ) (2 m), 8i|yi = 1
|w · xij + b| 1
⇠ij 0
Although this program is not convex, Bunescu et al. show how to solve it with a Concave-Convex
solver (CCCP), which gives an exact solution unlike a heuristic algorithm. Furthermore, they show
that this algorithm works particularly well on problems with sparse positive bags (where there are
few positives in every positive bag).
13
An important note about this method is that it builds a bag classifier rather than an instance
classifier. However, due to the way this problem is structured, we could easily classify an instance
by pretending it is a bag of size 1.
3 Applications
In this section we will review a few interesting applications of Multiple Instance Learning. More
importantly, we will highlight the fundamental differences in the datasets these applications in-
volve. In particular, there appear to be two distinct types of applications for MIL. We borrow the
terminology of [31] and call these two types “polymorphism ambiguity” and “part-whole ambi-
guity”. In the former, an object can have multiple different appearances in input space and it is
not known which of these is responsible for the output/label; in the latter, an object can be broken
into several pieces/parts, each of which has a representation in input space, and labels are assigned
to the whole objects rather than the parts, though only one part is responsible for the label. We
continue with a few concrete examples, and in Section 4 we discuss how the differences between
these types of applications affect algorithm design.
14
Figure 4: Polymorphism Ambiguity. This is a figure from [3] that we reproduce for convenience;
it shows an example of one molecule taking on different shapes. Presumably, only one of these two
shapes would be responsible for certain behaviors of this molecule.
15
{ }
{ }
Figure 5: Part/whole Ambiguity. These are two examples of a bag being generated by splitting an
image into regions with (TOP) image segmentation, and (BOTTOM) sliding windows. Note that a
label that would be assigned to the entire image ((TOP) tiger and (BOTTOM) face) is attributed to
just one of the smaller regions.
Another set of applications comes from Bioinformatics, where biological sequences (either
DNA base pairs, or amino acids) need to be classified. A common approach is to use a sliding
window across the sequence to form a bag of overlapping sub sequences [32, 9]. The assumption
is that only one of the subsequences is responsible for the behavior of the entire DNA sequence or
protein.
Finally, document/text classification has been investigated within a MIL framework [27]. A
document is general split into paragraphs, or into overlapping pieces consisting of several para-
graphs. The features computed for each piece of the document are presumably less noisy than
features computed over the entire document; this can lead to more precise classification.
16
ment has been independent of this distinction. The datasets belonging to these two different types
of ambiguities may behave in very different ways, and it would be interesting to see how these
behaviors can be used to our advantage.
One important difference in behavior between these two ambiguities is the effect of bag size, m.
Recall from Section 2.3 that the best known sample complexity bound for MIL is Õ(d2 m/✏2 ) [15].
This implies that increasing the bag size m actually making the problem harder. Of course this
bound would only hold if the data assumptions made by this PAC bound hold. In particular, there
are two important assumptions made: (1) the instances in a bag are drawn independently, and (2)
each positive bag contains at least one positive instances. In the case of polymorphism ambiguity,
these seem to be a reasonable assumptions, and the implied behavior tends to be consistent with
reality. For example, when dealing with molecules that have a huge number of possible shapes,
finding the one that matters becomes difficult. Note that in this scenario the user does not have any
control over the bag size - it depends purely on the data and is fixed. Furthermore, if a molecule is
labeled positive, then the bag definitely contains a positive instance.
For part/whole ambiguity, however, this assumption does not hold and the data may behave in
a completely opposite manner! Consider the case of object detection with sliding windows gener-
ating a bag. Sampling an image densely with sliding windows gives us a higher chance of cropping
out the “perfect” bounding box around the face in an image. Therefore, in this application, a larger
bag may actually make learning easier. Note that in this type of application, the instances are not
at all independent. The other key difference is that here we are actually not guaranteed to have a
positive instance in a positive bag. For example, consider the case of using image segmentation to
generate a bag - what if the segmentation performs poorly and returns nonsense segments? Note
that in this scenario the user does have control over the bag size, as there is usually way of deter-
mining how many parts the whole object is broken into to generate a bag. The only way to ensure
that the MIL assumption holds in this application is to generate bags exhaustively: take every pos-
sible segment in an image, crop out every possible bounding box, etc. Of course, this would be
prohibitive in terms of resources as the amount of data explodes – in some cases the number of
parts for one object is actually infinite.
In the future, it would be interesting to develop algorithms that take these observations into
account. In particular, an algorithm specifically tailored to the part/whole ambiguity datasets would
be useful in many vision and audition applications. Furthermore, the theoretical properties of these
types of datasets should be explored by making more appropriate assumptions.
Acknowledgements
The author would like to thank Serge Belongie, Sanjoy Dasgupta and Lawrence Saul for their
feedback.
17
References
[1] Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag (1995)
[2] Bishop, C.: Pattern Recognition and Machine Learning (Information Science and Statistics),
Secaucus. NJ, USA. Springer, Heidelberg (2006)
[3] Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem
with axis parallel rectangles. A.I. (1997)
[4] Keeler, J.D., Rumelhart, D.E., Leow, W.K.: Integrated segmentation and recognition of hand-
printed numerals. In: NIPS. (1990)
[5] Ray, S., Page, D.: Multiple instance regression. In: ICML. (2001) 425–432
[6] Zhou, Z., Zhang, M.: Multi-Instance Multi-Label Learning with Application to Scene Clas-
sification. In: NIPS. (2006)
[7] Wang, J., Zucker, J.: Solving the multiple-instance problem: A lazy learning approach. In:
ICML. (2000) 1119–1125
[8] Gartner, T., Flach, P., Kowalczyk, A., Smola, A.: Multi-instance kernels. In: ICML. (2002)
179–186
[9] Ray, S., Craven, M.: Supervised versus multiple instance learning: an empirical comparison.
ICML (2005) 697–704
[10] Viola, P., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection. In: NIPS.
(2005)
[11] Bunescu, R., Mooney, R.: Multiple instance learning for sparse positive bags. In: ICML.
(2007) 105–112
[12] Auer, P., Long, P., Srinivasan, A.: Approximating Hyper-Rectangles: Learning and Pseudo-
random Sets. Journal of Computer and System Sciences 57(3) (1998) 376–388
[13] Long, P., Tan, L.: PAC Learning Axis-aligned Rectangles with Respect to Product Distribu-
tions from Multiple-Instance Examples. Machine Learning 30(1) (1998) 7–21
[14] Valiant, L.: A theory of the learnable. Communications of the ACM 27(11) (1984) 1134–
1142
[15] Blum, A., Kalai, A.: A Note on Learning from Multiple-Instance Examples. Machine Learn-
ing 30(1) (1998) 23–29
[16] Kearns, M.: Efficient noise-tolerant learning from statistical queries. Journal of the ACM
(JACM) 45(6) (1998) 983–1006
18
[17] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge Univ. Press (2004)
[18] Ramon, J., De Raedt, L.: Multi instance neural networks. ICML, Workshop on Attribute-
Value and Relational Learning (2000)
[19] Babenko, B., Dollár, P., Tu, Z., Belongie, S.: Simultaneous learning and alignment: Multi-
instance and multi-pose learning. In: Faces in Real-Life Images. (2008)
[20] Saul, L.K., Rahim, M.G., Allen, J.B.: A statistical model for robust integration of narrowband
cues in speech. Computer Speech and Language 15 (2001) 175–194
[21] Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: NIPS. (1998)
[22] Zhang, Q., Goldman, S.: EM-DD: An improved multiple-instance learning technique. In:
NIPS. Volume 14. (2002) 1073–1080
[23] Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences 55 (1997) 119–139
[24] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of
boosting. Technical report, Stanford University (1998)
[25] Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. NIPS
12 (2000) 512–518
[26] Friedman, J.: Greedy function approximation: A gradient boosting machine. Annals of
Statistics 29(5) (2001) 1189–1232
[27] Andrews, S., Hofmann, T., Tsochantaridis, I.: Multiple instance learning with generalized
support vector machines. A.I. (2002) 943–944
[28] Andrews, S.: Learning from Ambiguous Examples. PhD thesis, Brown University (2007)
[29] Chen, Y., Wang, J.: Image categorization by learning and reasoning with regions. JMLR 5
(2004) 913–939
[30] Bi, J., Chen, Y., Wang, J.: A sparse support vector machine approach to region-based image
categorization. In: CVPR. (2005)
[31] Andrews, S., Hofmann, T.: Multiple Instance Learning via Disjunctive Programming Boost-
ing. In: NIPS. (2004)
[32] Zhang, Y., Chen, Y., Ji, X.: Motif Discovery as a Multiple-Instance Problem. (2006) 805–809
19