0% found this document useful (0 votes)
4 views10 pages

Generalized Majorization-Minimization

The document introduces Generalized Majorization-Minimization (G-MM), an optimization framework that enhances the traditional Majorization-Minimization (MM) approach by relaxing the requirement that bounds touch the objective function. G-MM allows for a more flexible selection of bounds, which improves algorithmic convergence and reduces sensitivity to initialization in non-convex optimization problems. Empirical results demonstrate that G-MM outperforms existing MM algorithms in various applications, including clustering and latent variable models.

Uploaded by

jonas.astrom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

Generalized Majorization-Minimization

The document introduces Generalized Majorization-Minimization (G-MM), an optimization framework that enhances the traditional Majorization-Minimization (MM) approach by relaxing the requirement that bounds touch the objective function. G-MM allows for a more flexible selection of bounds, which improves algorithmic convergence and reduces sensitivity to initialization in non-convex optimization problems. Empirical results demonstrate that G-MM outperforms existing MM algorithms in various applications, including clustering and latent variable models.

Uploaded by

jonas.astrom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Generalized Majorization-Minimization

Sobhan Naderi 1 Kun He 2 Reza Aghajani 3 Stan Sclaroff 4 Pedro Felzenszwalb 5

Abstract rithms work by iteratively optimizing a sequence of easy-


to-optimize surrogate functions that bound the objective.
Non-convex optimization is ubiquitous in ma- Two of the most successful instances of MM algorithms are
chine learning. Majorization-Minimization Expectation-Maximization (EM) (Dempster et al., 1977)
(MM) is a powerful iterative procedure for op- and the Concave-Convex Procedure (CCP) (Yuille & Ran-
timizing non-convex functions that works by op- garajan, 2003). However, both have a number of drawbacks
timizing a sequence of bounds on the function. in practice, such as sensitivity to initialization and lack of
In MM, the bound at each iteration is required to uncertainty modeling for latent variables. This has been
touch the objective function at the optimizer of noted in (Neal & Hinton, 1998; Felzenszwalb et al., 2010;
the previous bound. We show that this touching Parizi et al., 2012; Kumar et al., 2012; Ping et al., 2014).
constraint is unnecessary and overly restrictive.
We generalize MM by relaxing this constraint, We propose a new procedure, Generalized Majorization-
and propose a new optimization framework, Minimization (G-MM), for non-convex optimization. Our
named Generalized Majorization-Minimization approach is inspired by MM, but we generalize the bound
(G-MM), that is more flexible. For instance, G- construction process to allow for a set of valid bounds to be
MM can incorporate application-specific biases used, while still maintaining algorithmic convergence. This
into the optimization procedure without chang- generalization gives us more freedom in bound selection
ing the objective function. We derive G-MM al- and can be used to design better optimization algorithms.
gorithms for several latent variable models and In training latent variable models and in clustering prob-
show empirically that they consistently outper- lems, MM algorithms such as CCP and k-means are known
form their MM counterparts in optimizing non- to be sensitive to the initial values of the latent variables or
convex objectives. In particular, G-MM algo- cluster memberships. We refer to this problem as stickiness
rithms appear to be less sensitive to initialization. of the algorithm to the initial latent values. Our experimen-
tal results show that G-MM leads to methods that tend to
be less sticky to initialization. We demonstrate the benefit
1. Introduction of using G-MM on multiple problems, including k-means
clustering and applications of Latent Structural SVMs to
Non-convex optimization is ubiquitous in machine learn-
image classification with latent variables.
ing. For example, data clustering (MacQueen, 1967;
Arthur & Vassilvitskii, 2007), training classifiers with la-
tent variables (Yu & Joachims, 2009; Felzenszwalb et al., 1.1. Related Work
2010; Pirsiavash & Ramanan, 2014; Azizpour et al., 2015), One of the most popular and well studied iterative methods
and training visual object detectors from weakly labeled for non-convex optimization is the EM algorithm (Demp-
data (Song et al., 2014; Rastegari et al., 2015; Ries et al., ster et al., 1977). EM is best understood in the context of
2015) all lead to non-convex optimization problems. maximum likelihood estimation in the presence of missing
Majorization-Minimization (MM) (Hunter et al., 2000) is data, or latent variables. EM is a bound optimization algo-
an optimization framework for designing well-behaved op- rithm: in each E-step, a lower bound on the likelihood is
timization algorithms for non-convex functions. MM algo- constructed, and the M-step maximizes this bound.

* The paper was written when Kun He was at Boston Uni- Countless efforts have been made to extend the EM algo-
versity. 1 Google Research 2 Facebook Reality Labs 3 University rithm since its introduction. In (Neal & Hinton, 1998) it
of California San Diego 4 Boston University 5 Brown University. is shown that, while both steps in EM involve optimiz-
Correspondence to: Sobhan Naderi <[email protected]>. ing some functions, it is not necessary to fully optimize
the functions in each step; in fact, each step only needs
Proceedings of the 36 th International Conference on Machine
Learning, Long Beach, California, PMLR 97, 2019. Copyright to “make progress”. This relaxation can potentially avoid
2019 by the author(s). sharp local minima and even speed up convergence.
Generalized Majorization-Minimization

The Majorization-Minimization (MM) framework (Hunter 1998), use distributions other than the posterior in an alter-
et al., 2000) generalizes EM by optimizing a sequence of nating optimization of F . This fits into our framework, as
surrogate functions (bounds) on the original objective func- we use the exact same objective function, and only changes
tion. The Concave-Convex Procedure (CCP) (Yuille & the bound construction step (which amounts to picking the
Rangarajan, 2003) is a widely-used instance of MM where distribution q in EM). We propose both stochastic and de-
the surrogate function is obtained by linearizing the con- terministic strategies for bound construction, and demon-
cave part of the objective. Many successful learning algo- strate that they lead to higher quality solutions and less sen-
rithms employ CCP, e.g. the Latent SVM (Felzenszwalb sitivity to initialization than other EM-like methods.
et al., 2010). Other instances of MM algorithms include
iterative scaling (Pietra et al., 1997), and non-negative ma- 2. Proposed Optimization Framework
trix factorization (Lee & Seung, 1999). Another related
line of research concerns the Difference-of-Convex (DC) We consider minimization of functions that are bounded
programming (Tao, 1997), which can be shown to reduce from below. The extension to maximization is trivial. Let
to CCP under certain conditions. Convergence properties F (w) : Rd → R be a lower-bounded function that we wish
of such general “bound optimization” algorithms have been to minimize. We propose an iterative procedure that gener-
discussed in (Salakhutdinov et al., 2002). ates a sequence of solutions w1 , w2 , . . . until it converges.
The solution at iteration t ≥ 1 is obtained by minimiz-
Despite widespread success, MM (and CCP in particular)
ing an upper bound bt (w) to the objective function i.e.
has a number of drawbacks, some of which have moti-
wt = argminw bt (w). The bound at iteration t is chosen
vated our work. In practice, CCP often exhibits sticki-
from a set of “valid” bounds Bt (see Figure 1). In prac-
ness to initialization, which necessitates expensive initial-
tice, we take the members of Bt from a family F of func-
ization or multiple trials (Parizi et al., 2012; Song et al.,
tions that upper-bound F and can be optimized efficiently,
2014; Cinbis et al., 2016). In optimizing latent variable
such as quadratic functions, or quadratic functions with lin-
models, CCP lacks the ability to incorporate application-
ear constraints. F must be rich enough so that Bt is never
specific information without making modifications to the
empty. Algorithm 1 gives the outline of the approach.
objective function, such as prior knowledge or side infor-
mation (Xing et al., 2002; Yu, 2012), latent variable uncer- This general scheme is used in both MM and G-MM. How-
tainty (Kumar et al., 2012; Ping et al., 2014), and posterior ever, as we shall see in the rest of this section, MM and G-
regularization (Ganchev et al., 2010). Our framework ad- MM have key differences in the way they measure progress
dresses these drawbacks. Our key observation is that we and the way they construct new bounds.
can relax the constraint enforced by MM that requires the
bounds to touch the objective function, and this relaxation 2.1. Progress Measure
gives us the ability to better avoid sensitivity to initializa-
tion, and to incorporate side information. MM measures progress with respect to the objective val-
ues. To guarantee progress over time MM requires that the
A closely related work to ours is the “pseudo-bound” opti- bound at iteration t must touch the objective function at the
mization framework by (Tang et al., 2014). It generalizes previous solution, leading to the following constraint:
CCP using bounds that may intersect the objective func-
tion. In contrast, our framework uses valid bounds, but only MM constraint: bt (wt−1 ) = F (wt−1 ). (1)
relaxes the touching requirement. Also, the pseudo-bound
optimization framework is specific to binary energies in This touching constraint, together with the fact that wt min-
MRFs (although, it was recently generalized to multi-label imizes bt leads to F (wt ) ≤ F (wt−1 ). That is, the value of
energies in (Tang et al., 2019)), and it restricts the form of the objective function is non-increasing over time. How-
surrogate functions to parametric max-flow. ever, it can make it hard to avoid local minima, and elimi-
nates the possibility of using bounds that do not touch the
The generalized variants of EM proposed and analyzed by objective function but may have other desirable properties.
(Neal & Hinton, 1998) and (Gunawardana & Byrne, 2005)
are related to our work when we restrict our attention to In G-MM, we measure progress with respect to the bound
probabilistic models and the EM algorithm. EM can be values. It allows us to relax the touching constraint of MM,
viewed as a bound optimization procedure where the like- stated in (1), and require instead that,
lihood function involves both the model parameters θ and a (
distribution q over the latent variables, denoted by F (θ, q). b1 (w0 ) = F (w0 )
G-MM constraints: (2)
Choosing q to be the posterior leads to a lower bound on bt (wt−1 ) ≤ bt−1 (wt−1 ).
F that is tight at the current estimate of θ. Generalized
versions of EM, such as those given by (Neal & Hinton, Note that the G-MM constraints are weaker than MM: since
bt−1 is an upper bound on F , (1) implies (2). While MM
Generalized Majorization-Minimization

Algorithm 1 G-MM optimization


input w0 , η, 
1: v0 := F (w0 )
2: for t := 1, 2, . . . do
3: select bt ∈ Bt = B(wt−1 , vt−1 ) as in (3)
4: wt := argminw bt (w)
5: dt := bt (wt ) − F (wt )
6: vt := bt (wt ) − ηdt
Figure 1. Optimization of F using MM (red) and G-MM (blue). 7: if dt <  break
8: end for
In MM the bound b2 has to touch F at w1 . In G-MM we only
output wt
require that b2 be below b1 at w1 , leading to several choices B2 .

constraint implies that the sequence {F (wt )}t is decreas- a nearby local minimum. For instance, at iteration t, the
ing, G-MM only requires {bt (wt )}t to be decreasing. CCP bound for latent SVMs is obtained by fixing the latent
variables in the concave part of F to the maximizers of the
2.2. Bound Construction score of the model from iteration t−1, making wt attracted
to wt−1 . Similarly, in the E-step of EM, the posterior dis-
This section describes line 3 of Algorithm 1. To construct tribution of the latent variables is computed with respect to
a bound at iteration t, G-MM considers a “valid” subset wt−1 and, in the M-step, the model is updated to “match”
of upper bounds Bt ⊆ F. To guarantee convergence, we these fixed posteriors. This explains one reason why MM
restrict Bt to bounds that are below a threshold vt−1 at the algorithms are observed to be sticky to initialization.
previous solution wt−1 :
G-MM offers a more flexible bound construction scheme
Bt = B(wt−1 , vt−1 )
than MM. In Section 5 we show empirically that picking a
B(w, v) = {b ∈ F | b(w) ≤ v}. (3) valid bound randomly, i.e. bt ∼ U (Bt ), is less sensitive to
Initially, we set v0 = F (w0 ) to ensure that the first bound initialization and leads to better results compared to CCP
touches F . For t ≥ 1, we set vt = ηF (wt ) + (1 − η)bt (wt ) and EM. We also show that using good bias functions can
for some hyperparameter η ∈ (0, 1], which we call the further improve performance of the learned models.
progress coefficient. This guarantees making at least ηdt
progress, where dt = bt (wt )−F (wt ) is the gap between 3. Convergence of G-MM
the bound and the true objective value at wt . Small values
of η allow for gradual exploratory progress while large val- We show that, under general assumptions, the sequence
ues of η greedily select bounds that guarantee immediate {wt }t of bound minimizers converges, and Algorithm 1
progress. When η = 1 all valid bounds touch F at wt−1 , stops after finite steps (Theorem 1). With additional as-
corresponding to the MM requirement. Note that all the sumptions, we also prove that the limit of this sequence is
bounds b ∈ Bt satisfy (2). a stationary point of F (Theorem 2). We believe stronger
convergence properties depend on the structure of the func-
We consider two scenarios for selecting a bound from Bt . tion F , the family of the bounds F, and the bound selec-
In the first scenario we define a bias function g : Bt ×Rd → tion strategy, and should be investigated separately for each
R that takes a bound b ∈ Bt and a current solution w ∈ Rd specific problem. We prove Theorems 1 and 2 in the sup-
and returns a scalar indicating the goodness of the bound. plementary material.
We then select the bound with the largest bias value i.e.
bt = argmaxb∈Bt g(b, wt−1 ). In the second scenario we
Theorem 1. Suppose F is a lower-bounded, continuous
propose to choose a bound from Bt at random. Thus, we
function with compact sublevel sets, and F is a family
have both a deterministic (the 1st scenario) and a stochastic
of lower-bounded and m-strongly convex functions. Then
(the 2nd scenario) bound construction mechanism.
the sequence of minimizers {wt }t converges (i.e. the limit
w† = limt→∞ wt exists), and the gap dt = bt (wt )−F (wt )
2.3. Generalization over MM converges to 0.
MM algorithms, such as EM and CCP, are special cases of
G-MM that use a specific bias function g(b, w) = −b(w). Theorem 2. Suppose the assumptions in Theorem 1 holds.
Note that bt = argmaxb∈Bt −b(wt−1 ) touches F at wt−1 , In addition, let F be continuously differentiable, and F be
assuming Bt includes such a bound. Also, by definition, a family of smooth functions such that ∀b ∈ F, M I 
bt makes maximum progress with respect to the previous ∇2 b(w)  mI, for some m, M ∈ (0, ∞), where I is the
bound value bt−1 (wt−1 ). By choosing bounds that maxi- identity matrix. Then ∇F (w† ) = 0, namely, G-MM con-
mize progress, MM algorithms tend to rapidly converge to verges to a stationary point of F .
Generalized Majorization-Minimization

4. Derived Optimization Algorithms assigned to cluster j:


P
G-MM is applicable to a variety of non-convex optimiza- i∈Ij xi
µj = , Ij = {1 ≤ i ≤ n | zi = j}. (6)
tion problems, but for simplicity and ease of exposition, |Ij |
we primarily focus on latent variable models where bound
construction naturally corresponds to imputing latent vari- 4.2. Latent Structural SVM
ables in the model. In this section we derive G-MM al-
A Latent Structural SVM (LS-SVM) (Yu & Joachims,
gorithms for two widely used families of models, namely,
2009) defines a structured output classifier with latent vari-
k-means and Latent Structural SVM. Note that the train-
ables. It extends the Structural SVM (Joachims et al., 2009)
ing objectives of these two problems are non-differentiable
by introducing latent variables.
and, therefore, Theorem 2 does not apply to them. How-
ever, note that the theorem only gives a sufficient condition Let {(x1 , y1 ), . . . , (xn , yn )} denote a set of labeled exam-
but is not necessary for convergence of the algorithms. In ples with xi ∈ X and yi ∈ Y. We assume that each
fact, in all our experiments we observe that G-MM con- example xi has an associated latent value zi ∈ Z. Let
verges (e.g. see Table 4), and these algorithms significantly φ(x, y, z) : X × Y × Z → Rd denote a feature map. A
outperform their MM counterparts (see Section 5). vector w ∈ Rd defines a classifier ŷ : X → Y,
ŷ(x) = argmax(max w · φ(x, y, z)). (7)
4.1. k-means Clustering y z

Let {x1 , . . . , xn } denote n points and w = (µ1 , . . . , µk ) The LS-SVM training objective is defined as follows,
denote k cluster centers. We assign a cluster to each point, n
λ 1 X
denoted by zi ∈ {1, . . . , k}, ∀i ∈ {1, . . . , n}. The objec- F (w) = ||w||2 + max (w · φ(xi , y, z) + ∆(y, yi ))
tive function in k-means clustering is defined as follows, 2 n i=1 y,z

n
X − max w · φ(xi , yi , z) , (8)
F (w) = min ||xi − µzi ||2 , w = (µ1 , . . . , µk ). z
zi ∈{1,...,k}
i=1 where λ is a hyper-parameter that controls regularization
(4) and ∆(y, yi ) is a non-negative loss function that penalizes
Bound construction: We obtain a convex upper bound the prediction y when the ground truth label is yi .
on F by fixing the latent variables (z1 , . . . , zn ) to certain Bound construction: As in the case of k-means, a convex
values instead of minimizing over these variables. Such upper bound on the LS-SVM objective can be obtained by
bounds are quadratic convex functions of (µ1 , . . . , µk ), imputing latent variables. Specifically, for each example
( n
X
) xi , we fix zi ∈ Z, and replace the maximization in the last
F= ||xi − µzi ||2 ∀i, zi ∈ {1, . . . , k} . (5) term of the objective with a linear function w · φ(xi , yi , zi ).
i=1 This forms a family of convex piecewise quadratic bounds,
The k-means algorithm is an instance of MM methods. The
( n
λ 1X
algorithm repeatedly assigns each example to its nearest F= ||w||2 + max (w · φ(xi , y, z) + ∆(y, yi ))
2 n i=1 y,z
center to construct a bound, and then updates the centers )
by optimizing the bound. We can set g(b, w) = −b(w) in
G-MM to obtain the k-means algorithm. We can also de- − w · φ(xi , yi , zi ) ∀i, zi ∈ Z . (9)
fine g differently to obtain a G-MM algorithm that exhibits
other desired properties. For instance, a common issue in The CCP algorithm for LS-SVM selects the bound bt de-
clustering is cluster starvation. One can discourage starva- fined by zit = argmaxzi wt−1 ·φ(xi , yi , zi ). This particular
tion by defining g accordingly. choice is a special case of G-MM when g(b, w) = −b(w).
We select a bound from Bt uniformly at random by start- To generate random bounds from Bt we use the same ap-
ing from an initial configuration z = (z1 , . . . , zn ) that cor- proach as in the case of k-means clustering. We perform
responds to a valid bound in Bt (e.g. k-means solution). a random walk in a graph where the nodes are latent con-
We then do a random walk on a graph whose nodes are la- figurations leading to valid bounds, and the edges connect
tent configurations defining valid bounds. The neighbors of latent configurations that differ in a single latent variable.
a latent configuration z are other latent configurations that
Bound optimization: Optimization of b ∈ F corresponds
can be obtained by changing the value of one of the n latent
to a convex quadratic program and can be solved using dif-
variables in z.
ferent techniques, including gradient based methods (e.g.
Bound optimization: Optimization of b ∈ F can be done SGD) and the cutting-plane method (Joachims et al., 2009).
in closed form by setting µj to be the mean of all examples We use the cutting-plane method in our experiments.
Generalized Majorization-Minimization

4.3. Bias Function for Multi-fold MIL bounds correspond to setting η = 1, thus taking maximally
large steps towards a local minimum of the true objective.
The multi-fold MIL algorithm (Cinbis et al., 2016) was in-
troduced for training latent SVMs for weakly supervised
5.1. k-means Clustering
object localization, to deal with stickiness issues in train-
ing with CCP. It modifies how latent variables are updated We conduct experiments on four clustering datasets: Norm-
during training. (Cinbis et al., 2016) divide the training set 25 (Arthur & Vassilvitskii, 2007), D31 (Veenman et al.,
into K folds, and updates the latent variables in each fold 2002), Cloud (Arthur & Vassilvitskii, 2007), and GMM-
using a model trained on the other K − 1 folds. This al- 200. See the references for details about the datasets.
gorithm does not have a formal convergence guarantee. By GMM-200 was created by us and has 10000 samples taken
defining a suitable bias function, we can derive a G-MM from a 2-D Gaussian mixture model with 200 mixtures (50
algorithm that mimics the behavior of multi-fold MIL, and samples per each component). All the mixture components
yet, is convergent. have unit variance and their means are placed on a 70×70
Consider training an LS-SVM. Let S = {1, . . . , n} and square uniformly at random, while making sure the dis-
I ⊆ S denote a subset of S. Also, let zi ∈ Z denote the tance between any two centers is at least 2.5.
latent variable associated to training example (xi , yi ), and We compare results from three different initializations:
zIt denote the fixed latent values for training examples in- forgy selects k training examples uniformly at random
dexed by I in iteration t. We denote the model trained on without replacement to define initial cluster centers, ran-
{(xi , yi )|i ∈ I} with latent variables fixed to zIt in the last dom partition assigns training samples to cluster centers
maximization of (8) by w(I, zIt ). randomly, and k-means++ uses the algorithm in (Arthur
We assume access to a loss function `(w, x, y, z). For ex- & Vassilvitskii, 2007). In each experiment we run the algo-
ample, for the binary latent SVM where y ∈ {−1, 1}, ` is rithm 50 times and report the mean, standard deviation, and
the hinge loss: `(w, x, y, z) = max{0, 1 − y w · φ(x, z)}. the best objective value (4). Table 1 shows the results using
k-means (hard-EM) and G-MM. We note that the variance
We first consider the Leave-One-Out (LOO) setting, i.e. of the solutions found by G-MM is typically smaller than k-
K = n, and call the algorithm of (Cinbis et al., 2016) means. Moreover, the best and the average solutions found
LOO-MIL in this case. The update rule of LOO-MIL in by G-MM are always better than (or the same as) those
iteration t is to set found by k-means. This trend generalizes over different

t−1
 initialization schemes as well as different datasets.
zit = argmin ` w(S\i, zS\i ), xi , yi , z , ∀i ∈ S. (10)
z∈Z Although random partition seems to be a bad initializa-
After updating the latent values for all training examples, tion for k-means on all datasets, G-MM recovers from it.
the model w is retrained by optimizing the resulting bound. In fact, on D31 and GMM-200 datasets, G-MM initialized
by random partition performs better than when it is initial-
Now let us construct a G-MM bias function that mim- ized by other methods (including k-means++). Also, the
ics the behavior of LOO-MIL. Recall from (9) that each variance of the best solutions (across different initialization
bound b ∈ Bt is associated with a joint latent configuration methods) in G-MM is smaller than that of k-means. These
z(b) = (z1 , . . . , zn ). We use the following bias function: suggest that the G-MM optimization is less sticky to initial-
X 
t−1
 ization than k-means.
g(b, w) = − ` w(S\i, zS\i ), xi , yi , zi . (11)
i∈S Figure 2 shows the effect of the progress coefficient on the
quality of the solution found by G-MM. Different initial-
Note that picking a bound according to (11) is equivalent to ization schemes are color coded. The solid line indicates
the LOO-MIL update rule of (10) except that in (11) only the average objective over 50 iterations, the shaded area
valid bounds are considered; that is bounds that make at covers one standard deviation from the average, and the
least η-progress. dashed line indicates the best solution over the 50 trials.
For the general multi-fold case (i.e. K < n), the bias func- Smaller progress coefficients allow for more extensive ex-
tion can be derived similarly. ploration, and hence, smaller variance in the quality of the
solutions. On the other hand, when the progress coeffi-
cient is large G-MM is more sensitive to initialization (i.e.
5. Experiments is more sticky) and, thus, the quality of the solutions over
We evaluate G-MM and MM algorithms on k-means clus- multiple runs is more diverse. However, despite the greater
tering and LS-SVM training on various datasets. Recall diversity, the best solution is worse when the progress co-
from (3) that the progress coefficient η defines the set of efficient is large. G-MM reduces to k-means if we set the
valid bounds Bt in each step. CCP and standard k-means progress coefficient to 1 (i.e. the largest possible value).
Generalized Majorization-Minimization

opt. forgy random partition k-means++


dataset k
method avg ± std best avg ± std best avg ± std best
k-means 1.9e5±2e5 7.0e4 5.8e5±3e5 2.2e5 5.3e3±9e3 1.5
Norm-25 25
G-MM 9.7e3±1e4 1.5 2.0e4±0 2.0e4 4.5e3±8e3 1.5
k-means 1.69 ± 0.03 1.21 52.61 ± 47.06 4.00 1.55 ± 0.17 1.10
D31 31
G-MM 1.43 ± 0.15 1.10 1.21 ± 0.05 1.10 1.45 ± 0.14 1.10
k-means 1929 ± 429 1293 44453 ± 88341 3026 1237 ± 92 1117
Cloud 50
G-MM 1465 ± 43 1246 1470 ± 8 1444 1162 ± 95 1067
k-means 2.25 ± 0.10 2.07 11.20 ± 0.63 9.77 2.12 ± 0.07 1.99
GMM-200 200
G-MM 2.04 ± 0.09 1.90 1.85 ± 0.02 1.80 1.98 ± 0.06 1.89
Table 1. Comparison of G-MM and k-means on four clustering datasets and three initialization methods; forgy initializes cluster centers
to random examples, random partition assigns each data point to a random cluster center, and k-means++ implements the algorithm
from (Arthur & Vassilvitskii, 2007). The mean, standard deviation, and best objective values out of 50 random trials are reported.
k-means and G-MM use the exact same initialization in each trial. G-MM consistently converges to better solutions.

Figure 2. Effect of the progress


coefficient η (x-axis) on the qual-
ity of the solutions found by G-
MM (y-axis) on two clustering
datasets. The quality is mea-
sured by the objective function
in (4). Lower values are better.
The average (solid line), the best
(dashed line), and the variance
(shaded area) over 50 trials are
shown in the plots and different
initializations are color coded.
(a) D31 (b) Cloud

5.2. Latent Structural SVM for Image Classification We try a stochastic as well as a deterministic bound con-
and Object Detection struction method. For the stochastic method, in each it-
eration t we uniformly sample a subset of examples St
We consider the problem of training an LS-SVM classifier
from the training set, and update their latent variables using
on the mammals dataset (Heitz et al., 2009). The dataset
zit = argmaxzi wt−1 · φ(xi , yi , zi ). Other latent variables
contains images of six mammal categories with image-
are kept the same as the previous iteration. We increase the
level annotation. Locations of the objects in these images
size of St across iterations.
are not provided, and therefore, treated as latent variables
in the model. Specifically, let x be an image and y be a class For the deterministic method, we use the bias function that
label (y ∈ {1, . . . , 6} in this case), and let z be the latent we described in Section 4.3. This is inspired by the multi-
location of the object in the image. We define φ(x, y, z) to fold MIL idea (Cinbis et al., 2016) and is shown to reduce
be a feature function with 6 blocks; one block for each cat- stickiness to initialization, especially in high dimensions.
egory. It extracts features from location z of image x and We set the number of folds to K = 10 in our experiments.
places them in the y-th block of the output and fills the rest
Table 2 shows results on the mammals dataset. Both vari-
with zero. We use the following multi-class classification
ants of G-MM consistently outperform CCP in terms of
rule:
training objective and test error. We observed that CCP
y(x) = argmax w · φ(x, y, z), w = (w1 , . . . , w6 ). (12) rarely updates the latent locations, under all initializations.
y,z
On the other hand, both variants of G-MM significantly al-
In this experiment we use a setup similar to that in (Ku- ter the latent locations, thereby avoiding the local minima
mar et al., 2012): we use Histogram of Oriented Gradients close to the initialization. Figure 3 visualizes this for top-
(HOG) for the image feature φ, and the 0-1 classification left initialization. Since objects rarely occur at the top-left
loss for ∆. We set λ = 0.4 in (8). We report 5-fold cross- corner in the mammals dataset, a good model is expected
validation performance. Three initialization strategies are to significantly update the latent locations. Averaged over
considered for the latent object locations: image center, five cross-validation folds, about 90% of the latent vari-
top-left corner, and random locations. The first is a rea- ables were updated in G-MM after training whereas this
sonable initialization since most objects are at the center in measure was 2.4% for CCP. This is consistent with the bet-
this dataset; the second initialization strategy is somewhat ter training objectives and test errors of G-MM.
adversarial.
Generalized Majorization-Minimization

center top-left random


Opt. Method
objective test error objective test error objective test error
CCP 1.21 ± 0.03 22.9 ± 9.7 1.35 ± 0.03 42.5 ± 4.6 1.47 ± 0.03 31.8 ± 2.6
G-MM random 0.79 ± 0.03 17.5 ± 3.9 0.91 ± 0.02 31.4 ± 10.1 0.85 ± 0.03 19.6 ± 9.2
G-MM biased 0.64 ± 0.02 16.8 ± 3.2 0.70 ± 0.02 18.9 ± 5.0 0.65 ± 0.02 14.6 ± 5.4
Table 2. LS-SVM results on the mammals dataset (Heitz et al., 2009). We report the mean and standard deviation of the training
objective (8) and test error over five folds. Three strategies for initializing latent object locations are tried: image center, top-left corner,
and random location. “G-MM random” uses random bounds, and “G-MM bias” uses a bias function inspired by multi-fold MIL (Cinbis
et al., 2016). Both variants consistently and significantly outperform the CCP baseline.

topleft: CCCP topleft: G−MM random topleft: G−MM biased, K=10


0 0 0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3

0.4 0.4 0.4

0.5 0.5 0.5

0.6 0.6 0.6


0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6
Figure 3. Latent location changes after learning, in relative image coordinates, for all five cross-validation folds, for the top-left initial-
ization on the mammals dataset. Left to right: CCP, “G-MM random”, “G-MM biased” (K=10). Each cross represents a training image;
cross-validation folds are color coded differently. Averaged over five folds, CCP only alters 2.4% of all latent locations, leading to very
bad performance. “G-MM random” and “G-MM biased” alter 86.2% and 93.6% on average, respectively, and perform much better.

5.3. Latent Structural SVM for Scene Recognition graph. Two nodes in the graph are connected if their cor-
responding cells in the image grid are next to each other.
We implement the reconfigurable model of (Parizi et al.,
Unary terms in the graph cut are the dot product scores be-
2012) (called RBoW) to do scene classification on MIT-
tween the feature vector extracted from an image region
Indoor dataset (Quattoni & Torralba, 2009), which has im-
and a part filter plus the corresponding region-to-part as-
ages from 67 indoor scene categories. We segment each
signment score. Pairwise terms in the graph cut implement
image into a 10×10 regular grid and treat the grid cells as
a Potts model that encourages coherent labelings. Specif-
image regions. We train a model with 200 shared parts.
ically, the penalty of labeling two neighboring nodes dif-
Any part can be used to describe the data in a region. We
ferently is λ and it is zero otherwise. λ controls the co-
use the activations of the 4096 neurons at the penultimate
herency of the initial assignments. We experiment using
layer of the pre-trained hybrid ConvNet of (Zhou et al.,
λ ∈ {0, 0.25, 0.5, 1}. We also experiment with random ini-
2014) to extract features from image regions and use PCA
tialization, which corresponds to assigning zi ’s randomly.
to reduce the dimensionality of the features to 240.
This is the simplest form of initialization and does not re-
The RBoW model is an instance of LS-SVM models. The quire discovering initial part filters.
latent variables are the assignments of parts to image re-
We do G-MM optimization using both random and bi-
gions and the output structure is the multi-valued category
ased bounds. For the latter we use a bias function g(b, w)
label predictions. LS-SVMs are known to be sensitive to
that measures coherence of the labeling from which the
initialization (a.k.a. the stickiness issue). To cope with
bound was constructed. Recall from (9) that each bound
this issue (Parizi et al., 2012) uses a generative version
in b ∈ Bt corresponds to a labeling of the image regions.
of the model to initialize the training of the discriminative
We denote the labeling corresponding to the bound b by
model. Generative models are typically less sticky but per-
z(b) = (z1 , . . . , zn ) where zi = (zi,1 , . . . , zi,100 ) specifies
form worse in practice. To validate the hypothesis regard-
part assignments for all the 100 regions in the i-th image.
ing stickiness of LS-SVMs we train models with several
Also, let E(zi ) denote a function that measures coherence
initialization strategies.
of the labeling zi . In fact, E(zi ) is the Potts energy function
Initializing training entails the assignment of parts to image on a graph whose nodes are zi,1 , . . . , zi,100 . The graph re-
regions i.e. setting zi ’s in (9) to define the first bound. To spects a 4-connected neighborhood system (recall that zi,r
this end we first discover 200 parts that capture discrimi- corresponds to the r-th cell in the 10×10 grid defined on
native features in the training data. We then run graph cut the i-th image). If two neighboring nodes zi,r and zi,s get
on each training image to obtain part assignments to image different labels the energy E(zi ) increases by 1. For biased
regions. Each cell in the 10×10 image grid is a node in the bounds we use the following bias function which favors
Generalized Majorization-Minimization

Random λ = 0.00 λ = 0.25 λ = 0.50 λ = 1.00


Opt. Method
Acc.% ± std O.F. Acc. % O.F. Acc. % O.F. Acc. % O.F. Acc. % O.F.
CCP 41.94 ± 1.1 15.20 40.88 14.81 43.99 14.77 45.60 14.72 46.62 14.70
G-MM random 47.51 ± 0.7 14.89 43.38 14.71 44.41 14.70 47.12 14.66 49.88 14.58
G-MM biased 49.34 ± 0.9 14.55 44.83 14.63 48.07 14.51 53.68 14.33 56.03 14.32
Table 3. Performance of LS-SVM trained with CCP and G-MM on MIT-Indoor dataset. We report classification accuracy (Acc.%)
and the training objective value (O.F.). Columns correspond to different initialization schemes. “Random” assigns random parts to
regions. λ controls the coherency of the initial part assignments: λ = 1 (λ = 0) corresponds to the most (the least) coherent case.
“G-MM random” uses random bounds and “G-MM biased” uses the bias function of (13). η = 0.1 in all the experiments. Coherent
initializations lead to better models in general, but, they require discovering good initial parts. “G-MM” outperforms CCP, especially
with random initialization. “G-MM biased” performs the best.

bounds that correspond to more coherent labelings: G-MM


experiment setup MM
random biased
n scene λ = 0.0 145 107 87
X recognition λ = 1.0 65 69 138
g(b, w) = − E(zi ), z(b) = (z1 , . . . , zn ). (13)
i=1
data forgy 35.76 ± 7.8 91.52 ± 4.4
clustering rand. part. 114.98 ± 12.9 241.89 ± 2.1
(GMM-200) k-means++ 32.92 ± 5.8 80.78 ± 2.9
Table 3 compares performance of models trained using data forgy 37.18 ± 12.1 87.68 ± 15.4
CCP and G-MM with random and biased bounds. For G- clustering rand. part. 65.14 ± 18.7 138.64 ± 5.9
(Cloud) k-means++ 21.3 ± 4.1 44.12 ± 10.7
MM with random bounds we repeat the experiment five
times and report the average over these five trials. Also, Table 4. Comparison of the number of iterations that MM and G-
for random initialization, we do five trials using different MM take to converge in the scene recognition and the data clus-
random seeds and report the mean and standard deviation tering experiment with different initializations. The numbers re-
of the results. G-MM does better than CCP under all ini- ported for the clustering experiment are the average and standard
tializations. It also converges to a solution with lower train- deviation over 50 trials.
ing objective value than CCP. Our results show that picking
bounds uniformly at random from the set of valid bounds 6. Conclusion
is slightly (but consistently) better than committing to the
CCP bound. We get a remarkable boost in performance We introduced Generalized Majorization-Minimization
when we use a reasonable prior over bounds (i.e. the bias (G-MM), an iterative bound optimization framework that
function of (13)). With λ = 1, CCP attains accuracy of generalizes Majorization-Minimization (MM). Our key ob-
46.6%, whereas G-MM attains 49.9%, and 56.0% accuracy servation is that MM enforces an overly-restrictive touch-
with random and biased initialization respectively. More- ing constraint when constructing bounds, which is inflexi-
over, G-MM is less sensitive to initialization. ble and can lead to sensitivity to initialization. By adopt-
ing a different measure of progress, G-MM relaxes this
constraint, allowing more freedom in bound construction.
5.4. Running Time
Specifically, we propose deterministic and stochastic ways
G-MM bounds make a fraction of the progress that can of selecting bounds from a set of valid ones. This general-
be made in each bound construction step. Therefore, we ized bound construction process tends to be less sensitive
would expect G-MM to require more steps to converge to initialization, and enjoys the ability to directly incorpo-
when compared to MM. We report the number of iterations rate rich application-specific priors and constraints, with-
in MM and G-MM in Table 4. The results for G-MM de- out modifications to the objective function. In experiments
pend on the value of the progress coefficient η which is with several latent variable models, G-MM algorithms are
set to match the experiments in the paper; η = 0.02 for the shown to significantly outperform their MM counterparts.
clustering experiment (Section 5.1) and η = 0.10 for the
Future work includes applying G-MM to a wider range
scene recognition experiment (Section 5.3).
of problems and theoretical analysis, such as convergence
The overhead of the bound construction step depends on rate. We also note that, although G-MM is more conserva-
the application. For example, in the scene recognition ex- tive than MM in moving towards nearby local minima, it
periment, optimizing the bounds takes orders of magnitude still requires making progress in every step. Another inter-
more than sampling them (a couple of hours vs. a few esting research direction is to enable G-MM to occasion-
seconds). In the clustering experiment, however, the opti- ally pick bounds that do not make progress with respect to
mization step is solved in closed form whereas sampling a the solution of the previous bound, thereby making it pos-
bound involves performing a random walk on a large graph sible to get out of local minima, while still maintaining the
which can take a couple of minutes to run. convergence guarantees of the method.
Generalized Majorization-Minimization

References Neal, R. and Hinton, G. A view of the EM algorithm that


justifies incremental, sparse, and other variants. Learn-
Arthur, D. and Vassilvitskii, S. K-means++: The advan-
ing in Graphical Models, 1998.
tages of careful seeding. In Symposium on Discrete Al-
gorithms (SODA), 2007. Parizi, S. N., Oberlin, J., and Felzenszwalb, P. Recon-
figurable models for scene recognition. In Conference
Azizpour, H., Arefiyan, M., Naderi, S., and Carlsson, S.
on Computer Vision and Pattern Recognition (CVPR),
Spotlight the negatives: A generalized discriminative
2012.
latent model. In British Machine Vision Conference
(BMVC), 2015. Pietra, S. D., Pietra, V. D., and Lafferty, J. Inducing fea-
Cinbis, R. G., Verbeek, J., and Schmid, C. Weakly Super- tures of random fields. Pattern Analysis and Machine In-
telligence, IEEE Transactions on, 19(4):380–393, 1997.
vised Object Localization with Multi-fold Multiple In-
stance Learning. IEEE Transactions on Pattern Analysis Ping, W., Liu, Q., and Ihler, A. Marginal structured SVM
and Machine Intelligence, January 2016. with hidden variables. In International Conference on
Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum Machine Learning (ICML), 2014.
likelihood from incomplete data via the EM algorithm. Pirsiavash, H. and Ramanan, D. Parsing videos of actions
Journal of the royal statistical society. Series B (method- with segmental grammars. In Conference on Computer
ological), pp. 1–38, 1977. Vision and Pattern Recognition (CVPR), 2014.
Felzenszwalb, P., Girshick, R., McAllester, D., and Ra-
Quattoni, A. and Torralba, A. Recognizing indoor scenes.
manan, D. Object detection with discriminatively trained
In Conference on Computer Vision and Pattern Recogni-
part-based models. IEEE Transactions on Pattern Anal-
tion (CVPR), 2009.
ysis and Machine Intelligence (TPAMI), 2010.
Rastegari, M., Hajishirzi, H., and Farhadi, A. Discrimina-
Ganchev, K., Graça, J., Gillenwater, J., and Taskar, B. Pos-
tive and consistent similarities in instance-level multiple
terior regularization for structured latent variable mod-
instance learning. In Conference on Computer Vision
els. The Journal of Machine Learning Research, 11:
and Pattern Recognition (CVPR), 2015.
2001–2049, 2010.
Ries, C., Richter, F., and Lienhart, R. Towards automatic
Gunawardana, A. and Byrne, W. Convergence theorems for
bounding box annotations from weakly labeled images.
generalized alternating minimization procedures. The
Multimedia Tools and Applications, 2015.
Journal of Machine Learning Research, 6:2049–2073,
2005. Salakhutdinov, R., Roweis, S., and Ghahramani, Z. On the
convergence of bound optimization algorithms. In Pro-
Heitz, G., Elidan, G., Packer, B., and Koller, D. Shape-
ceedings of the Nineteenth conference on Uncertainty in
based object localization for descriptive classification.
Artificial Intelligence (UAI), pp. 509–516. Morgan Kauf-
International journal of computer vision (IJCV), 2009.
mann Publishers Inc., 2002.
Hunter, D., Lange, K., and Yang, I. Optimization transfer
using surrogate objective functions. Journal of Compu- Song, H. O., Girshick, R., Jegelka, S., Mairal, J., Harchaoi,
tational and Graphical Statistics, 2000. Z., and Darrell, T. On learning to localize objects with
minimal supervision. In International Conference on
Joachims, T., Finley, T., and Yu, C.-N. J. Cutting-plane Machine Learning (ICML), 2014.
training of structural SVMs. Machine Learning, 2009.
Tang, M., Ayed, I. B., and Boykov, Y. Pseudo-bound op-
Kumar, M. P., Packer, B., and Koller, D. Modeling latent timization for binary energies. In European Conference
variable uncertainty for loss-based learning. In Interna- on Computer Vision (ECCV). Springer, 2014.
tional Conference on Machine Learning (ICML), 2012.
Tang, M., Marin, D., Ayed, I. B., and Boykov, Y. Ker-
Lee, D. D. and Seung, H. S. Learning the parts of objects by nel cuts: Kernel and spectral clustering meet regular-
non-negative matrix factorization. Nature, 401(6755): ization. In International Journal of Computer Vision
788–791, 1999. (IJCV). Springer, 2019.

MacQueen, J. Some methods for classification and analysis Tao, P. D. Convex analysis approach to dc programming:
of multivariate observations. In Berkeley Symposium on Theory, algorithms and applications. Acta Mathematica
Mathematical Statistics and Probability, 1967. Vietnamica, 22(1):289–355, 1997.
Generalized Majorization-Minimization

Veenman, C. J., Reinders, M., and Backer, E. A maximum


variance cluster algorithm. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence (TPAMI), 2002.
Xing, E. P., Jordan, M. I., Russell, S. J., and Ng, A. Y. Dis-
tance metric learning with application to clustering with
side-information. In Becker, S., Thrun, S., and Ober-
mayer, K. (eds.), Advances in Neural Information Pro-
cessing Systems (NIPS), pp. 521–528. MIT Press, 2002.
Yu, C.-N. Transductive learning of structural svms via
prior knowledge constraints. In International Confer-
ence on Artificial Intelligence and Statistics (AISTATS),
pp. 1367–1376, 2012.

Yu, C.-N. J. and Joachims, T. Learning structural SVMs


with latent variables. In International Conference on
Machine Learning (ICML). ACM, 2009.
Yuille, A. and Rangarajan, A. The concave-convex proce-
dure. Neural computation, 2003.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva,
A. Learning deep features for scene recognition using
places database. In Advances in neural information pro-
cessing systems (NIPS), 2014.

You might also like