Differentiable Deep Clustering With Cluster Size Constraints
Differentiable Deep Clustering With Cluster Size Constraints
then one could train a supervised model by minimiz- While the choice of the reconstruction loss of an auto-
ing over θ an empirical risk of the form encoder (with encoder fθ and decoder gϕ )
n
1X n
R(θ, Y ) = ` (cθ (xi ), yi ) . X
n i=1 `ae
r (θ) = min kxi − gϕ (fθ (xi )) k2 , (2)
ϕ
i=1
Since we are considering an unsupervised setting, Y
is not available. A solution is thus to jointly optimize is the standard for the representation loss, these meth-
the above criterion over θ and Y to learn both a class ods vary mostly in their choices of the clustering losses,
assignment and a classifier: the auto-encoder model, and the optimization strate-
gies (in particular to prevent trivial solutions).
(θ̂, Ŷ ) ∈ argmin R(θ, Y ). (1)
θ∈Θ,Y ∈[1,K]n Song et al. (2013) are one of the earliest to learn a
An obvious computational difficulty is that this prob- representation for clustering by tweaking the objective
lem involves the discrete variable Y . Besides, some function of a standard auto-encoder. They formulate
kind of regularization is required in this double opti- the problem as minimizing the combined loss (P) with
mization task to prevent trivial solutions; adding con- the objective of k-means as the clustering loss:
straints on Y is crucial to prevent empty or overpop- ( n )
X
ulated clusters. Joulin et al. (2010) propose a convex km
`c (θ) = min p min kfθ (xi ) − µj k 2
.
relaxation of (1) in the case of linear regression with µ1 ,...,µk ∈R j∈[1,K]
i=1
the squared loss `(u, v) = (u − v)2 for binary prob- (3)
lems (K = 2). In that case, the objective function is To optimize the objective function over the encoder pa-
quadratic in Y and they use the standard semidefinite rameters θ, the decoder parameters ϕ and the cluster
programming (SDP) relaxation for the matrix Y Y > to centers µ1 , . . . , µk , they alternate one epoch of stochas-
approximate a minimum. tic gradient descent over (θ, ϕ), with one update of the
cluster centers and assignments.
A different approach is used by Chang et al. (2017)
who recast the clustering problem as a binary classifi- While most state-of-the-art methods rely on clustering
cation problem: given two data points, do they belong objectives that are strongly linked to k-means, Joint
to the same cluster? The resulting algorithm, Deep Unsupervised LEarning (JULE) (Yang et al., 2016)
Adaptive Clustering (DAC) can be summarized as fol- uses a clustering loss based on “agglomerative cluster-
lows: each data point is mapped to a vector in the unit ing”. Starting from clusters consisting of datapoints,
ball of RK thanks to fθ , which represents its proba- the training alternates between a few steps of agglom-
bility to belong to each class. These probabilities are erative clustering, i.e., merging similar clusters, and a
then compared with the cosine distance, which defines backward pass during which the network parameters
the similarity of the two points. The points are as- are updated to minimize the clustering loss. Although
sumed to belong to the same class if the similarity is this method has a more flexible geometry, it requires
building an affinity graph of the dataset after each up- and k-means, and rely on entropic regularization to
date and is thus computationally heavy. derive a fully-differentiable clustering loss that can be
used in (P) and directly optimized with SGD. We give
Xie et al. (2016) propose Deep Embedded Clustering
an insight on the effect of regularization in the cluster
(DEC) which starts with a pre-training phase us-
assignment problem, and show that the soft k-means
ing only the reconstruction loss `r (θ, ϕ) and then im-
loss introduced by Fard et al. (2018) can be interpreted
proves the clustering ability of the representation by
as an optimal transport loss with only one marginal
optimizing fθ in a self-supervised manner. Their clus-
constraint. The constraints on cluster sizes that nat-
tering loss is the Kullback-Leibler divergence between
urally occur with optimal transport allow to enforce
the soft-assignments qik of each point i to each clus-
a prior on cluster sizes without relying to additional
ter k and the square of the soft-assignments, which
terms in the optimization problem. This leads to bet-
should push the embedding to favor harder assign-
ter clustering performance on benchmark datasets.
ments. There are several variants of DEC using more
sophisticated auto-encoders and training techniques
such as Guo et al. (2017). The DEPICT algorithm 2 Clustering with Optimal Transport
(Dizaji et al., 2017) similarly minimize the KL diver-
gence to sharpen their assignments but also introduce Cluster assignment as an optimal transport
a classifier hβ , that outputs hβ (z) a probability distri- problem Consider n sample points {x1 , . . . , xn } ⊂
bution over the k classes (typically, a neural network Rd embedded in the representation space via fθ : Rd →
with softmax activation at the last layer). Thus the Rp , and K clusters in that representation space with
clustering loss corresponds to the possibility to dis- centers {µ1 , . . . , µK } ⊂ Rp . We want to assign samples
criminate the data in k different classes. to clusters so that:
The clustering loss in Deep Clustering Network (DCN)
(Yang et al., 2017) is the objective of k-means in the (i) each sample is assigned to exactly one cluster,
representation space. However, minimizing the total
loss L over θ, ϕ, µ (cluster centers) and π (cluster as- (ii) each cluster k = 1, . . . , K contains exactly nk
signments) jointly is challenging. Thus, Yang et al. points,
(2017) alternate optimization in (θ, ϕ) for fixed (µ, π),
which becomes a variant of AE training, and in (µ, π) (iii) the total distance (in the representation space)
for fixed (θ, ϕ). The Deep k-means (DKM) (Fard et al., between cluster centers and their assigned samples
2018) algorithm uses the same loss as DCN but re- is minimal.
lax the assignment problem by replacing the cluster
assignments with soft-assignments in the k-means ob- The mathematical formulation of the above problem
jective. This results in a clustering loss can be jointly reads as follows:
minimized over θ and µ, using stochastic gradient de-
n X
K
scent (SGD), and leads to state-of-the-art performance X
in deep clustering (Fard et al., 2018). The latter is the `OT
c = min kfθ (xi ) − µk k2 πk,i (OT )
π∈{0,1/n}n×K
i=1 k=1
approach which is closest to ours, as we also propose
a fully differentiable objective based on k-means. 1
s.t. π1K = 1n , (c1 )
n
π > 1n = w, (c2 )
Clustering and optimal transport There is a link
between k-means clustering and optimal transport, where w = ( nn1 , . . . , nnk ) ∈ ∆K is the vector of cluster
which was first noticed in Pollard (1982) and studied proportions.
in more details in Canas and Rosasco (2012). Roughly
speaking, optimal transport is equivalent to a con- The above problem is known as optimal transport be-
def. Pn
strained formulation of k-means in which the cluster tween the discrete measure α = n1 i=1 δfθ (xi ) and
sizes are prescribed. This framework makes sense in a PK
β = k=1 nnk δµk . If we remove the constraint on clus-
setting where the proportion of each class in a dataset ter sizes (c1 ), it boils down to the objective function of
is known, but no information is available at the indi- the k-means problem with cluster centers {µ1 , . . . , µK }
vidual level. Cuturi and Doucet (2014) introduced an (Pollard, 1982).
entropic regularization of that problem which allows
for an efficient solver.
Entropic regularization of optimal transport
Solving optimal transport is computationally expen-
Contributions Following Cuturi and Doucet (2014) sive as it requires solving a large linear program and a
we exploit the connection between optimal transport common workaround in the literature is to regularize
Algorithm 1 Sinkhorn’s Algorithm for Reg. OT Algorithm 2 OT-based Deep Clustering
1: Parameters ε ; niter 1: Parameters K, npre−train , nepochs , m
2: Input (fθ (xi ))i=1...n ; (µk )k=1...K ; w 2: Input Dataset (x1 , . . . , xn ), cluster proportions w
2
3: Cik = kfθ (xi ) − µk k ∀ i, k 3: Initialize fθ (encoder) and gϕ (decoder) with ran-
4: M = exp(−C/ε) dom weights
5: Initialize b ← 1K 4: Initialize centers µ with k-means on embedded
6: for j = 1, 2, . . . , niter do images (fθ (x1 ), . . . , fθ (xn ))
1n
7: a ← n1 M b
5: for i = 1 to npre−train do (pre-training)
w for j = 1 to n/m do
8: b ← M >a 6:
(j) (j)
9: Return πik = ai Mik bk ∀ i, k 7: Dj = (x1 , . . . , xm ) batch of size m
8: Compute loss `ae r (θ, ϕ)
9: Update θ, ϕ with a gradient step
the problem with entropy (Cuturi, 2013). The regu- 10: for i = 1 to nepochs do (Training)
larized problem then reads as follows: 11: for j = 1 to n/m do
(j) (j)
12: Dj = (x1 , . . . , xm ) batch of size m
n X
K
X 13: Compute π(fθ (Dj ), µ, w) with Sinkhorn
`OTε
= min kfθ (xi ) − µk k2 πk,i Compute loss `ae OTε
r (θ, ϕ) + `c (θ, µ)
c 14:
π∈[0,1]n×K
i=1 k=1
15: Update θ, ϕ and µ with a gradient step
+ επk,i (log(πk,i ) − 1) (OTε ) 16: for i = 1 to n do (Final Clustering)
1 17: Assign xi to ki = argmin k kfθ (xi ) − µk k
2
s.t. π1K = 1n , (c1 )
n
π > 1n = w, (c2 )
1
Pn PK
The addition of entropy allows to solve the prob- • `OT
c
ε
→ n i=1 k=1 wk kfθ (xi ) − µk k2 .
lem with a much faster iterative algorithm, called
Sinkhorn’s algorithm, whose iterations are summa-
rized in Algorithm 1. Although this fast solver is
Proof. The proposition is an adaptation of Theorem 1
the main reason why regularized optimal transport be-
in (Genevay et al., 2018) to our clustering setting.
came routinely used in machine learning tasks, recent
papers have exploited the fact that it also leads to a
differentiable loss, whose gradients can be easily com-
puted with backpropagation through Sinkhorn itera- The choice of ε is a crucial question: when epsilon gets
tions (Genevay et al., 2018; Salimans et al., 2018). smaller – i.e. when we get closer to ’true’ Optimal
It is known that a linear program reaches its maximum Transport – Sinkhorn’s algorithm requires more itera-
on the vertices, which is why the Optimal Transport tions to converge (see e.g. (Peyré et al., 2019)) mean-
problem is equivalent to its relaxation to the simplex. ing that a better approximation of optimal transport
The addition of entropy will move the solution away comes at a heavy computational price. However, it
from the optimal vertex, towards the center of the con- has recently been proved that approximating optimal
straint polytope thus yielding smoother assignments transport from samples – which is typically the case in
(Peyré et al., 2019). This is formalized in the proposi- machine learning, it is actually beneficial to use ε not
tion below: too small to avoid the curse of dimension from which
optimal transport suffers (Genevay et al., 2019).
Proposition 1. Consider the regularized optimal
transport problem (OT ), and the optimal assignment
πε . Link with soft-assignments in k-means The op-
When ε → 0 : timal transport formulation includes two marginal
constraints, one being that each sample is assigned
• πε → π (the solution of (OT )) to exactly one cluster and the other being the cluster
size. The latter constraint can be omitted to obtain an
• `OT
c
ε
→ `OT
c ,
objective which is that of k-means. When regularizing
the optimal transport problem with only one marginal
When ε → ∞ : constraint, we thus get a differentiable k-means objec-
tive.
• πε → n1 1n w (i.e., each point is assigned to all Proposition 2. Consider the variant of entropy-
clusters according to global proportions w) regularized Optimal Transport with only one marginal
Figure 1: Accuracy on MNIST (left) and CIFAR10 (right)
constraint (i.e., no prior on cluster sizes): the objective function of regularized optimal trans-
n X
K
port, which allows to:
X
min kfθ (xi ) − µk k2 πk,i
π∈[0,1]n×K (i) enforce a prior on the cluster proportions,
i=1 k=1
+ επk,i (log(πk,i ) − 1)
(ii) obtain a loss that is differentiable with respect to
1 both the cluster centers (µk )k and the embedding
s.t. π1K = 1n ,
n parameters θ.
then the optimal assignment π ∗ is given by
2 The clustering problem (P) becomes:
∗ e−kfθ (xi )−µk k /ε
πk,i = PK 2
. (7)
n k0 =1 e−kfθ (xi )−µk0 k /ε min `ae OTε
r (θ, ϕ) + `c (θ, µ) , (11)
θ,ϕ,µ
layers of respective depths 32, 64, 128, respective ker- averaged over 50 runs. The final accuracies are re-
nel sizes 5, 5, 3, and common stride of 2, followed by ported in Table 1 along with the standard deviations.
a fully connected layer to a latent space of dimension We can seen that the optimal transport loss achieves
10. In both cases we use batches of size m = 300. superior accuracy, but mostly doesn’t need to rely on
The gradient updates are made with the Adam algo- pre-training to get good performance, contrarily to soft
rithm and the standard learning rate from TensorFlow k-means, whose performance is only slightly above the
(0.001) with step decay, as there should not be param- baseline without pre-training. To assert the statis-
eter tuning in an unsupervised setting. The algorithm tical significance of the superiority of optimal trans-
used for training is summarized in Algorithm 2. We port in this framework, we further run a Welch’s t-test
also run k-means on raw pixels, to give an idea of how over the final accuracies in the 50 runs. Without pre-
much structure is induced in the data by the embed- training, we find out that optimal transport is signifi-
ding fθ . Following usual guidelines regarding regular- cantly better than soft k-means (p-value of 0.0067 for
ization for optimal transport, we set ε = 10−2 which CIFAR10 and 10−12 for MNIST). With pre-training,
gives a good enough approximation of optimal trans- optimal transport is still significantly better than soft
port without requiring too many Sinkhorn iterations. k-means for MNIST at the 10% level (p-value of 0.10)
We show in Fig. 2 that the final performance is robust but it’s not the case for CIFAR10 (p-value of 0.33).
to the choice of ε as long as it is not too large. Note that for both datasets, pre-training did not yield
significantly better performance for optimal transport
Evaluation After each epoch, we evaluate the dif- (p-values > 0.2), while it significantly improves soft
ferent methods by computing the accuracy given by k-means (p-values < 0.05).
matching clusters to classes through the following for-
Fig. 2 displays the accuracy after 200 epochs for each
mula:
Xn method, as a function of ε. We see that the compet-
accuracy = max 1yi =σ(ki ) , itive advantage of OT over soft k-means is robust to
σ∈S
i=1 the choice of ε as long as it is not too large. Inciden-
where yi and ki are respectively the class label and tally, the methods with pre-training are more robust
the cluster index associated to xi and S is the set of to large values for ε. Note that these curves are aver-
permutations of {1 . . . K}. For the ‘AE + k-means’ aged over only 5 runs and thus can not be regarded as
method, the cluster assignment is made by running statistically significant, they merely serve as a proof of
k-means on the embedded data, while for both ‘soft robustness of the method to the chosen parameter.
k-means’ and ‘OT’, since these methods also learn the
cluster centers (µ1 , . . . , µK ) we assign the point xi to 4 Conclusion and discussion
2
cluster ki such that ki = argmin k kfθ (xi ) − µk k . The
optimal matching between clusters and classes is done
In this paper we propose a new fully differentiable
via the Hungarian algorithm, as in the literature.
framework for deep clustering, based on regularized
The evolution of accuracy during training for all three optimal transport, which generalizes the recently pro-
methods (auto-encoder + kmeans, soft k-means, op- posed approach of Fard et al. (2018) based on soft-k-
timal transport) is plotted in Fig. 1. The curves are means. Its main advantage over competing methods
k-means AE + k-means soft k-means soft k-means (p) OT OT (p)
Table 1: Average accuracy from clustering on CIFAR and MNIST datasets (over 50 runs) with standard deviation
and max and min accuracy over the runs (second line). (p) means ‘with pre-training’
is its ability to naturally enforce a prior on class pro- straint, using for example unbalanced optimal trans-
portions. This significantly improves performance on port with a relaxed version of the Sinkhorn algorithm
datasets with well balanced classes, without relying on which penalizes the marginal constraints instead of en-
pre-training of the embedding. forcing them strongly (Chizat et al., 2018). This may
be particularly relevant when small batches are con-
In our experiments we observed a benefit over soft-
sidered, as one would not expect the composition of
k-means in situations where the classes are balanced.
each batch to perfectly reflect the overall composition.
An interesting direction to explore is to extend the ap-
plication of our method when the prior knowledge on Finally, we note that our formulation of the cluster-
cluster size is not uniform. This may be relevant in ing problem with optimal transport is closely linked
cases, for example, when an expert provides a rough to that of Dulac-Arnold et al. (2019), who propose an
estimate of the proportion of different classes, such as algorithm to learn a classifier from label proportions
the proportion of cancer cells in an histopathological in mini-batches. The main difference is that instead
image. While our formulation lends itself naturally to of using fθ to parametrize an embedding, the authors
non-uniform cluster proportions, we observed in pre- directly use it to predict the probability of belonging
liminary experiments that it performs poorly if no care to class k. The last layer of fθ is a softmax, and thus
is taken to make sure that the cluster size constraints the term kfθ (xi ) − µk k2 is replaced by fθ (xi )k . Be-
(w in Algorithm 2) is coherent with the set of clus- sides, they loosen the marginal constraint prescribing
ter centers (vectors µk in Algorithm 2). We found the clusters proportions by using unbalanced optimal
in particular that this is often poorly achieved by the transport. The latter can also be implemented in our
initialization of the centers via k-means (step 4 in Al- proposed method, as it consists in using a variation of
gorithm 2. For instance, consider the case where we Sinkhorn’s algorithm (Chizat et al., 2018). However,
have two clusters – say images of ones and images of the performance that they report for large batch sizes
twos – and we know that we should have 20 % of the in lower than what we report for the fully unsuper-
former and 80 % of the latter. However, if the k-means vised task in our experiments. This would make our
initialization of the centers puts the first center in the optimal transport approach a good candidate for the
middle of twos and the second center in the middle of learning with proportions problem.
ones, the algorithm will try to enforce a 80% propor-
tion on ones and 20% proportion on twos. Generally References
speaking, to ensure that we are enforcing the cluster
proportions properly, some sort of matching has to be Yoshua Bengio, Aaron Courville, and Pascal Vincent.
done before the learning phase between the indexes of Representation learning: A review and new perspec-
the clusters and the indexes of the classes. This could tives. IEEE Trans. Pattern Anal. Mach. Intell., 35
be done in a supervised way, by using an example from (8):1798–1828, August 2013.
each class to initialize the centers. This extension falls Guillermo Canas and Lorenzo Rosasco. Learning prob-
in the framework of one-shot learning, with an addi- ability measures with respect to optimal transport
tional knowledge on class proportions. metrics. In Advances in Neural Information Pro-
cessing Systems, pages 2492–2500, 2012.
Another extension of our method would be relax the
strict constraint of cluster proportions to a soft con- Jianlong Chang, Lingfeng Wang, Gaofeng Meng,
Shiming Xiang, and Chunhong Pan. Deep adaptive
image clustering. In Proceedings of the IEEE In- A. Joulin, F. Bach, and J. Ponce. Discriminative clus-
ternational Conference on Computer Vision, pages tering for image co-segmentation. In Proceedings
5879–5887, 2017. of the Conference on Computer Vision and Pattern
Recognition (CVPR), 2010.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer,
and François-Xavier Vialard. Scaling algorithms for Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez
unbalanced optimal transport problems. Mathemat- Rezende, and Max Welling. Semi-supervised learn-
ics of Computation, 87(314):2563–2609, 2018. ing with deep generative models. In Advances in
Neural Information Processing Systems 27: Annual
Marco Cuturi. Sorn distances: Lightspeed computa- Conference on Neural Information Processing Sys-
tion of optimal transport. In Advances in neural tems 2014, December 8-13 2014, Montreal, Quebec,
information processing systems, pages 2292–2300, Canada, pages 3581–3589, 2014.
2013.
Gabriel Peyré, Marco Cuturi, et al. Computational op-
Marco Cuturi and Arnaud Doucet. Fast computation timal transport. Foundations and Trends R in Ma-
of Wasserstein barycenters. In International Con- chine Learning, 11(5-6):355–607, 2019.
ference on Machine Learning, pages 685–693, 2014.
David Pollard. Quantization and the method of k-
K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and means. IEEE Transactions on Information theory,
H. Huang. Deep clustering via joint convolutional 28(2):199–205, 1982.
autoencoder embedding and relative entropy mini- Tim Salimans, Han Zhang, Alec Radford, and Dimitris
mization. In 2017 IEEE International Conference Metaxas. Improving gans using optimal transport.
on Computer Vision (ICCV), pages 5747–5756, Oct arXiv preprint arXiv:1803.05573, 2018.
2017. doi: 10.1109/ICCV.2017.612.
Chunfeng Song, Feng Liu, Yongzhen Huang, Liang
G. Dulac-Arnold, N. Zeghidour, M. Cuturi, L. Beyer, Wang, and Tieniu Tan. Auto-encoder based data
and J.-P. Vert. Deep multi-class learning from label clustering. In José Ruiz-Shulcloper and Gabriella
proportions. Technical Report 1905.12909, arXiv, Sanniti di Baja, editors, Progress in Pattern Recog-
2019. nition, Image Analysis, Computer Vision, and Ap-
Maziar Moradi Fard, Thibaut Thonet, and Eric plications, pages 117–124, Berlin, Heidelberg, 2013.
Gaussier. Deep k-means: Jointly clustering with k- Springer Berlin Heidelberg.
means and learning representations. arXiv preprint Junyuan Xie, Ross Girshick, and Ali Farhadi. Un-
arXiv:1806.10069, 2018. supervised deep embedding for clustering analysis.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. In Proceedings of the 33rd International Conference
Learning generative models with Sinkhorn diver- on International Conference on Machine Learning
gences. In International Conference on Artificial - Volume 48, ICML’16, pages 478–487. JMLR.org,
Intelligence and Statistics, 2018. 2016.
Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, and
Aude Genevay, Lénaic Chizat, Francis Bach, Marco
Mingyi Hong. Towards k-means-friendly spaces: Si-
Cuturi, and Gabriel Peyré. Sample complexity of
multaneous deep learning and clustering. In Doina
Sinkhorn divergences. In Proceedings of the 22nd
Precup and Yee Whye Teh, editors, Proceedings
International Conference on Artificial Intelligence
of the 34th International Conference on Machine
and Statistics (AISTATS), Naha, Okinawa, Japan,
Learning, volume 70 of Proceedings of Machine
2019.
Learning Research, pages 3861–3870, International
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Convention Centre, Sydney, Australia, 06–11 Aug
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron 2017. PMLR.
Courville, and Yoshua Bengio. Generative adversar- Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint
ial nets. In Z. Ghahramani, M. Welling, C. Cortes, unsupervised learning of deep representations and
N. D. Lawrence, and K. Q. Weinberger, editors, Ad- image clusters. 2016 IEEE Conference on Computer
vances in Neural Information Processing Systems Vision and Pattern Recognition (CVPR), Jun 2016.
27, pages 2672–2680. Curran Associates, Inc., 2014. doi: 10.1109/cvpr.2016.556.
Xifeng Guo, Long Gao, Xinwang Liu, and Jianping
Yin. Improved deep embedded clustering with local
structure preservation. In Proceedings of the 26th
International Joint Conference on Artificial Intel-
ligence, IJCAI’17, pages 1753–1759. AAAI Press,
2017. ISBN 978-0-9992411-0-3.