Differentiable Deep Clustering With Cluster Size Constraints

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views8 pages

Differentiable Deep Clustering With Cluster Size Constraints

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Differentiable Deep Clustering with Cluster Size Constraints

Aude Genevay Gabriel Dulac-Arnold Jean-Philippe Vert

CSAIL, MIT Google Brain Google Brain

Abstract an extension of this problem where we additionally

arXiv:1910.09036v1 [cs.LG] 20 Oct 2019

want to classify out-of-sample data not in the training

set, i.e., we want to infer a function c : X → [1, K]
Clustering is a fundamental unsupervised
that maps a given point in the data space to a class.
learning approach. Many clustering algo-
Many clustering techniques exist, such as agglomera-
rithms – such as k-means – rely on the
tive clustering, when X is endowed with a metric, or
euclidean distance as a similarity measure,
k-means, when X = Rd . However, these methods of-
which is often not the most relevant met-
ten fail when applied to complex higher-dimentional
ric for high dimensional data such as im-
data, as the usual metric on the dataspace (e.g. eu-
ages. Learning a lower-dimensional embed-
clidean in Rd ) is not meaningful.
ding that can better reflect the geometry
of the dataset is therefore instrumental for For complex data such as images or strings, recent
performance. We propose a new approach years have witnessed significant progress in learning
for this task where the embedding is per- representations fθ : X → Rp , where fθ is typically a
formed by a differentiable model such as a deep neural network (DNN) with parameters θ, and Rp
deep neural network. By rewriting the k- is a low dimensional space that intends to capture the
means clustering algorithm as an optimal underlying structure of the data (Bengio et al., 2013).
transport task, and adding an entropic regu- The representation fθ is usually optimized to either
larization, we derive a fully differentiable loss solve a supervised task such as image classification us-
function that can be minimized with respect ing a training set of labeled images, or, in the absence
to both the embedding parameters and the of labels, can be used to summarize the data, e.g., by
cluster parameters via stochastic gradient de- using (variational) auto-encoders or GANs (Kingma
scent. We show that this new formulation et al., 2014; Goodfellow et al., 2014).
generalizes a recently proposed state-of-the-
Any such learned representation fθ can be used in
art method based on soft-k-means by adding
conjunction with any clustering algorithm to cluster
constraints on the cluster sizes. Empiri-
the training set mapped to the representation space
cal evaluations on image classification bench-
(fθ (x1 ), . . . , fθ (xn )) ⊂ Rp . However, there is no guar-
marks suggest that compared to state-of-the-
antee that the representation fθ is “good” for this clus-
art methods, our optimal transport-based ap-
tering task if it has been optimized for another task.
proach provide better unsupervised accuracy
In this work, we propose a new approach to learn a
and does not require a pre-training phase.
representation fθ well adapted to solve the clustering
task, in the absence of labels. This setting has been
considered by several authors recently, often under the
1 Introduction name of “deep clustering”, and before reviewing exist-
ing approaches let us fix some notations.
Clustering is a fundamental unsupervised learning
task, where, given a training set of data (x1 , . . . , xn ) ⊂
X and a number of classes K, we aim to partition the Setting and notations. We consider a dataset of
training data into K non-overlapping clusters corre- n points Dn = (x1 , . . . , xn ) ∈ X n where X ∈ Rd
sponding to different classes of points. We consider is the space of input data (e.g., X = R32×32×3 for
3-channel 32 × 32 images). Our goal is to cluster
the data into K clusters, which might correspond to
the number of classes in a supervised setting. For a
dataset Dn we denote by fθ (Dn ) the embedded dataset
(fθ (x1 ), . . . , fθ (xn )), where fθ : X → Rp is a deep neu-
ral network with parameters θ and p << d. above a certain threshold. The parameters θ are then
updated in order to increase similarity between similar
Additionally, we odenote by ∆K =
n
K
PK points.
z ∈ R+ : k=1 zk = 1 the probability sim-
plex. For any integer n ∈ N, let 1n ∈ Rn be the Learning a ’clustering-friendly’ representation.
n-dimensional vector of ones. Given two vectors However, representations are useful beyond the clus-
a, b ∈ Rn , where bi 6= 0 for i ∈ [1, n], we denote by tering task, e.g., to extract features or reduce the di-
a b ∈ Rn the vector with entries (a b)i = ai /bi . mension of a dataset, which is why many methods in
For any vector or matrix M , we denote by exp(M ) the literature rather learn a representation fθ : X →
the matrix obtained by applying the operation entry- Rp such that fθ (Dn ) becomes easy to cluster. In most
wise, e.g., [exp(M )]ij = exp(Mij ), and by M > the cases, the problem consists in minimizing a combined
transpose of M . loss made of two terms : (i) a “representation loss”
`r (θ) to ensure that the representation space is not
A discriminative approach to clustering. The degenerate (ii) a “clustering loss” `c (θ) to enforce that
clustering task can be recast as a classification prob- the learned representation fθ (D) is relevant for the
lem, without relying on a representation of the data, clustering task. This yields the following optimization
focusing directly on the clustering function cθ : X → problem:
[1, K]. This only makes sense in cases where classes
are well separated. In that case, the labels Y = def.
min L(θ) = `r (θ) + λ`c (θ) . (P)
(y1 , . . . , yn ) ∈ [1, K]n of the training set are known, θ

then one could train a supervised model by minimiz- While the choice of the reconstruction loss of an auto-
ing over θ an empirical risk of the form encoder (with encoder fθ and decoder gϕ )
n
1X n
R(θ, Y ) = ` (cθ (xi ), yi ) . X
n i=1 àe
r (θ) = min kxi − gϕ (fθ (xi )) k2 , (2)
ϕ
i=1
Since we are considering an unsupervised setting, Y
is not available. A solution is thus to jointly optimize is the standard for the representation loss, these meth-
the above criterion over θ and Y to learn both a class ods vary mostly in their choices of the clustering losses,
assignment and a classifier: the auto-encoder model, and the optimization strate-
gies (in particular to prevent trivial solutions).
(θ̂, Ŷ ) ∈ argmin R(θ, Y ). (1)
θ∈Θ,Y ∈[1,K]n Song et al. (2013) are one of the earliest to learn a
An obvious computational difficulty is that this prob- representation for clustering by tweaking the objective
lem involves the discrete variable Y . Besides, some function of a standard auto-encoder. They formulate
kind of regularization is required in this double opti- the problem as minimizing the combined loss (P) with
mization task to prevent trivial solutions; adding con- the objective of k-means as the clustering loss:
straints on Y is crucial to prevent empty or overpop- ( n )
X
ulated clusters. Joulin et al. (2010) propose a convex km
`c (θ) = min p min kfθ (xi ) − µj k 2
.
relaxation of (1) in the case of linear regression with µ1 ,...,µk ∈R j∈[1,K]
i=1
the squared loss `(u, v) = (u − v)2 for binary prob- (3)
lems (K = 2). In that case, the objective function is To optimize the objective function over the encoder pa-
quadratic in Y and they use the standard semidefinite rameters θ, the decoder parameters ϕ and the cluster
programming (SDP) relaxation for the matrix Y Y > to centers µ1 , . . . , µk , they alternate one epoch of stochas-
approximate a minimum. tic gradient descent over (θ, ϕ), with one update of the
cluster centers and assignments.
A different approach is used by Chang et al. (2017)
who recast the clustering problem as a binary classifi- While most state-of-the-art methods rely on clustering
cation problem: given two data points, do they belong objectives that are strongly linked to k-means, Joint
to the same cluster? The resulting algorithm, Deep Unsupervised LEarning (JULE) (Yang et al., 2016)
Adaptive Clustering (DAC) can be summarized as fol- uses a clustering loss based on “agglomerative cluster-
lows: each data point is mapped to a vector in the unit ing”. Starting from clusters consisting of datapoints,
ball of RK thanks to fθ , which represents its proba- the training alternates between a few steps of agglom-
bility to belong to each class. These probabilities are erative clustering, i.e., merging similar clusters, and a
then compared with the cosine distance, which defines backward pass during which the network parameters
the similarity of the two points. The points are as- are updated to minimize the clustering loss. Although
sumed to belong to the same class if the similarity is this method has a more flexible geometry, it requires
building an affinity graph of the dataset after each up- and k-means, and rely on entropic regularization to
date and is thus computationally heavy. derive a fully-differentiable clustering loss that can be
used in (P) and directly optimized with SGD. We give
Xie et al. (2016) propose Deep Embedded Clustering
an insight on the effect of regularization in the cluster
(DEC) which starts with a pre-training phase us-
assignment problem, and show that the soft k-means
ing only the reconstruction loss `r (θ, ϕ) and then im-
loss introduced by Fard et al. (2018) can be interpreted
proves the clustering ability of the representation by
as an optimal transport loss with only one marginal
optimizing fθ in a self-supervised manner. Their clus-
constraint. The constraints on cluster sizes that nat-
tering loss is the Kullback-Leibler divergence between
urally occur with optimal transport allow to enforce
the soft-assignments qik of each point i to each clus-
a prior on cluster sizes without relying to additional
ter k and the square of the soft-assignments, which
terms in the optimization problem. This leads to bet-
should push the embedding to favor harder assign-
ter clustering performance on benchmark datasets.
ments. There are several variants of DEC using more
sophisticated auto-encoders and training techniques
such as Guo et al. (2017). The DEPICT algorithm 2 Clustering with Optimal Transport
(Dizaji et al., 2017) similarly minimize the KL diver-
gence to sharpen their assignments but also introduce Cluster assignment as an optimal transport
a classifier hβ , that outputs hβ (z) a probability distri- problem Consider n sample points {x1 , . . . , xn } ⊂
bution over the k classes (typically, a neural network Rd embedded in the representation space via fθ : Rd →
with softmax activation at the last layer). Thus the Rp , and K clusters in that representation space with
clustering loss corresponds to the possibility to dis- centers {µ1 , . . . , µK } ⊂ Rp . We want to assign samples
criminate the data in k different classes. to clusters so that:
The clustering loss in Deep Clustering Network (DCN)
(Yang et al., 2017) is the objective of k-means in the (i) each sample is assigned to exactly one cluster,
representation space. However, minimizing the total
loss L over θ, ϕ, µ (cluster centers) and π (cluster as- (ii) each cluster k = 1, . . . , K contains exactly nk
signments) jointly is challenging. Thus, Yang et al. points,
(2017) alternate optimization in (θ, ϕ) for fixed (µ, π),
which becomes a variant of AE training, and in (µ, π) (iii) the total distance (in the representation space)
for fixed (θ, ϕ). The Deep k-means (DKM) (Fard et al., between cluster centers and their assigned samples
2018) algorithm uses the same loss as DCN but re- is minimal.
lax the assignment problem by replacing the cluster
assignments with soft-assignments in the k-means ob- The mathematical formulation of the above problem
jective. This results in a clustering loss can be jointly reads as follows:
minimized over θ and µ, using stochastic gradient de-
n X
K
scent (SGD), and leads to state-of-the-art performance X
in deep clustering (Fard et al., 2018). The latter is the ÒT
c = min kfθ (xi ) − µk k2 πk,i (OT )
π∈{0,1/n}n×K
i=1 k=1
approach which is closest to ours, as we also propose
a fully differentiable objective based on k-means. 1
s.t. π1K = 1n , (c1 )
n
π > 1n = w, (c2 )
Clustering and optimal transport There is a link
between k-means clustering and optimal transport, where w = ( nn1 , . . . , nnk ) ∈ ∆K is the vector of cluster
which was first noticed in Pollard (1982) and studied proportions.
in more details in Canas and Rosasco (2012). Roughly
speaking, optimal transport is equivalent to a con- The above problem is known as optimal transport be-
def. Pn
strained formulation of k-means in which the cluster tween the discrete measure α = n1 i=1 δfθ (xi ) and
sizes are prescribed. This framework makes sense in a PK
β = k=1 nnk δµk . If we remove the constraint on clus-
setting where the proportion of each class in a dataset ter sizes (c1 ), it boils down to the objective function of
is known, but no information is available at the indi- the k-means problem with cluster centers {µ1 , . . . , µK }
vidual level. Cuturi and Doucet (2014) introduced an (Pollard, 1982).
entropic regularization of that problem which allows
for an efficient solver.
Entropic regularization of optimal transport
Solving optimal transport is computationally expen-
Contributions Following Cuturi and Doucet (2014) sive as it requires solving a large linear program and a
we exploit the connection between optimal transport common workaround in the literature is to regularize
Algorithm 1 Sinkhorn’s Algorithm for Reg. OT Algorithm 2 OT-based Deep Clustering
1: Parameters ε ; niter 1: Parameters K, npre−train , nepochs , m
2: Input (fθ (xi ))i=1...n ; (µk )k=1...K ; w 2: Input Dataset (x1 , . . . , xn ), cluster proportions w
2
3: Cik = kfθ (xi ) − µk k ∀ i, k 3: Initialize fθ (encoder) and gϕ (decoder) with ran-
4: M = exp(−C/ε) dom weights
5: Initialize b ← 1K 4: Initialize centers µ with k-means on embedded
6: for j = 1, 2, . . . , niter do images (fθ (x1 ), . . . , fθ (xn ))
1n
7: a ← n1 M b
5: for i = 1 to npre−train do (pre-training)
w for j = 1 to n/m do
8: b ← M >a 6:
(j) (j)
9: Return πik = ai Mik bk ∀ i, k 7: Dj = (x1 , . . . , xm ) batch of size m
8: Compute loss àe r (θ, ϕ)
9: Update θ, ϕ with a gradient step
the problem with entropy (Cuturi, 2013). The regu- 10: for i = 1 to nepochs do (Training)
larized problem then reads as follows: 11: for j = 1 to n/m do
(j) (j)
12: Dj = (x1 , . . . , xm ) batch of size m
n X
K
X 13: Compute π(fθ (Dj ), µ, w) with Sinkhorn
ÒTε
= min kfθ (xi ) − µk k2 πk,i Compute loss àe OTε
r (θ, ϕ) + `c (θ, µ)
c 14:
π∈[0,1]n×K
i=1 k=1
15: Update θ, ϕ and µ with a gradient step
+ επk,i (log(πk,i ) − 1) (OTε ) 16: for i = 1 to n do (Final Clustering)
1 17: Assign xi to ki = argmin k kfθ (xi ) − µk k
2
s.t. π1K = 1n , (c1 )
n
π > 1n = w, (c2 )
1
Pn PK
The addition of entropy allows to solve the prob- • ÒT
c
ε
→ n i=1 k=1 wk kfθ (xi ) − µk k2 .
lem with a much faster iterative algorithm, called
Sinkhorn’s algorithm, whose iterations are summa-
rized in Algorithm 1. Although this fast solver is
Proof. The proposition is an adaptation of Theorem 1
the main reason why regularized optimal transport be-
in (Genevay et al., 2018) to our clustering setting.
came routinely used in machine learning tasks, recent
papers have exploited the fact that it also leads to a
differentiable loss, whose gradients can be easily com-
puted with backpropagation through Sinkhorn itera- The choice of ε is a crucial question: when epsilon gets
tions (Genevay et al., 2018; Salimans et al., 2018). smaller – i.e. when we get closer to ’true’ Optimal
It is known that a linear program reaches its maximum Transport – Sinkhorn’s algorithm requires more itera-
on the vertices, which is why the Optimal Transport tions to converge (see e.g. (Peyré et al., 2019)) mean-
problem is equivalent to its relaxation to the simplex. ing that a better approximation of optimal transport
The addition of entropy will move the solution away comes at a heavy computational price. However, it
from the optimal vertex, towards the center of the con- has recently been proved that approximating optimal
straint polytope thus yielding smoother assignments transport from samples – which is typically the case in
(Peyré et al., 2019). This is formalized in the proposi- machine learning, it is actually beneficial to use ε not
tion below: too small to avoid the curse of dimension from which
optimal transport suffers (Genevay et al., 2019).
Proposition 1. Consider the regularized optimal
transport problem (OT ), and the optimal assignment
πε . Link with soft-assignments in k-means The op-
When ε → 0 : timal transport formulation includes two marginal
constraints, one being that each sample is assigned
• πε → π (the solution of (OT )) to exactly one cluster and the other being the cluster
size. The latter constraint can be omitted to obtain an
• ÒT
c
ε
→ ÒT
c ,
objective which is that of k-means. When regularizing
the optimal transport problem with only one marginal
When ε → ∞ : constraint, we thus get a differentiable k-means objec-
tive.
• πε → n1 1n w (i.e., each point is assigned to all Proposition 2. Consider the variant of entropy-
clusters according to global proportions w) regularized Optimal Transport with only one marginal
Figure 1: Accuracy on MNIST (left) and CIFAR10 (right)

constraint (i.e., no prior on cluster sizes): the objective function of regularized optimal trans-
n X
K
port, which allows to:
X
min kfθ (xi ) − µk k2 πk,i
π∈[0,1]n×K (i) enforce a prior on the cluster proportions,
i=1 k=1
+ επk,i (log(πk,i ) − 1)
(ii) obtain a loss that is differentiable with respect to
1 both the cluster centers (µk )k and the embedding
s.t. π1K = 1n ,
n parameters θ.
then the optimal assignment π ∗ is given by
2 The clustering problem (P) becomes:
∗ e−kfθ (xi )−µk k /ε
πk,i = PK 2
. (7)
n k0 =1 e−kfθ (xi )−µk0 k /ε min `ae OTε
r (θ, ϕ) + `c (θ, µ) , (11)
θ,ϕ,µ

Proof. This is a convex function of π with linear con-

where `ae
r is the reconstruction loss of the auto-encoder
straints. Denoting by λ the Lagrange multiplier for
defined in (2) and `OT
c
ε
is the result of the regularized
the constraint, the Lagrangian is :
optimal transport problem (OT ). This function is dif-
n X
X K ferentiable with respect to all its variables, and can
L(π, λ) = kfθ (xi ) − µk k2 πk,i (8) be minimized with SGD. We outline the full learning
i=1 k=1 procedure in Algorithm 2.
+ επk,i (log(πk,i ) − 1) (9)
n
X X 1 3 Experiments
+ λi ( πik − ) (10)
i=1
n
k=1K
We assess the efficiency of our optimal transport-based
The first order conditions of the Lagrangian gives (7). clustering loss on two classical benchmark datasets
for image classification, MNIST and CIFAR10. We
compare it against a simple baseline which consists in
The solution of that problem corresponds exactly to learning an embedding with an auto-encoder and then
the differentiable k-means loss introduced by Fard running k-means on the embedded data, and against
et al. (2018), which the authors motivated by replacing the state-of-the-art DKM method of Fard et al. (2018)
the min in the k-means objective (3) with the softmin based on the soft k-means loss, which already proved
function. Hence Proposition 2 provides a new inter- its superiority over existing methods.
pretation of DKM, and shows that the approach we
propose below generalizes DKM by adding constraints Experimental setting For the MNIST dataset, we
on the cluster sizes. follow mutiple examples from the litterature by using
a fully connected auto-encoder with 500-250-10-250-
Solving the clustering problem Several papers 500 structure and ReLu activation. For the CIFAR
in the literature use the k-means objective as the clus- dataset, we use as an encoder a standard convolutional
tering loss `c in (P). We propose to replace this by network using ReLu activation, with three convolution
Figure 2: Accuracy after 200 epochs (averaged over 5 runs) as a function of ε for MNIST (left) and CIFAR10
(right)

layers of respective depths 32, 64, 128, respective ker- averaged over 50 runs. The final accuracies are re-
nel sizes 5, 5, 3, and common stride of 2, followed by ported in Table 1 along with the standard deviations.
a fully connected layer to a latent space of dimension We can seen that the optimal transport loss achieves
10. In both cases we use batches of size m = 300. superior accuracy, but mostly doesn’t need to rely on
The gradient updates are made with the Adam algo- pre-training to get good performance, contrarily to soft
rithm and the standard learning rate from TensorFlow k-means, whose performance is only slightly above the
(0.001) with step decay, as there should not be param- baseline without pre-training. To assert the statis-
eter tuning in an unsupervised setting. The algorithm tical significance of the superiority of optimal trans-
used for training is summarized in Algorithm 2. We port in this framework, we further run a Welch’s t-test
also run k-means on raw pixels, to give an idea of how over the final accuracies in the 50 runs. Without pre-
much structure is induced in the data by the embed- training, we find out that optimal transport is signifi-
ding fθ . Following usual guidelines regarding regular- cantly better than soft k-means (p-value of 0.0067 for
ization for optimal transport, we set ε = 10−2 which CIFAR10 and 10−12 for MNIST). With pre-training,
gives a good enough approximation of optimal trans- optimal transport is still significantly better than soft
port without requiring too many Sinkhorn iterations. k-means for MNIST at the 10% level (p-value of 0.10)
We show in Fig. 2 that the final performance is robust but it’s not the case for CIFAR10 (p-value of 0.33).
to the choice of ε as long as it is not too large. Note that for both datasets, pre-training did not yield
significantly better performance for optimal transport
Evaluation After each epoch, we evaluate the dif- (p-values > 0.2), while it significantly improves soft
ferent methods by computing the accuracy given by k-means (p-values < 0.05).
matching clusters to classes through the following for-
Fig. 2 displays the accuracy after 200 epochs for each
mula:
Xn method, as a function of ε. We see that the compet-
accuracy = max 1yi =σ(ki ) , itive advantage of OT over soft k-means is robust to
σ∈S
i=1 the choice of ε as long as it is not too large. Inciden-
where yi and ki are respectively the class label and tally, the methods with pre-training are more robust
the cluster index associated to xi and S is the set of to large values for ε. Note that these curves are aver-
permutations of {1 . . . K}. For the ‘AE + k-means’ aged over only 5 runs and thus can not be regarded as
method, the cluster assignment is made by running statistically significant, they merely serve as a proof of
k-means on the embedded data, while for both ‘soft robustness of the method to the chosen parameter.
k-means’ and ‘OT’, since these methods also learn the
cluster centers (µ1 , . . . , µK ) we assign the point xi to 4 Conclusion and discussion
2
cluster ki such that ki = argmin k kfθ (xi ) − µk k . The
optimal matching between clusters and classes is done
In this paper we propose a new fully differentiable
via the Hungarian algorithm, as in the literature.
framework for deep clustering, based on regularized
The evolution of accuracy during training for all three optimal transport, which generalizes the recently pro-
methods (auto-encoder + kmeans, soft k-means, op- posed approach of Fard et al. (2018) based on soft-k-
timal transport) is plotted in Fig. 1. The curves are means. Its main advantage over competing methods
k-means AE + k-means soft k-means soft k-means (p) OT OT (p)

MNIST 0.513 0.801(±0.025) 0.810(±0.033) 0.837(±0.032) 0.851(±0.032) 0.846(±0.040)

[0.765, 0.912] [0.677, 0.883] [0.781, 0.923] [0.771, 0.932] [0.759, 0.928]

CIFAR10 0.206 0.237(±0.005) 0.239(±0.008) 0.243(±0.010) 0.243(±0.007) 0.244(±0.009)

[0.230, 0.259] [0.227, 0.257] [0.227, 0.261] [0.232, 0.260] [0.230, 0.266]

Table 1: Average accuracy from clustering on CIFAR and MNIST datasets (over 50 runs) with standard deviation
and max and min accuracy over the runs (second line). (p) means ‘with pre-training’

is its ability to naturally enforce a prior on class pro- straint, using for example unbalanced optimal trans-
portions. This significantly improves performance on port with a relaxed version of the Sinkhorn algorithm
datasets with well balanced classes, without relying on which penalizes the marginal constraints instead of en-
pre-training of the embedding. forcing them strongly (Chizat et al., 2018). This may
be particularly relevant when small batches are con-
In our experiments we observed a benefit over soft-
sidered, as one would not expect the composition of
k-means in situations where the classes are balanced.
each batch to perfectly reflect the overall composition.
An interesting direction to explore is to extend the ap-
plication of our method when the prior knowledge on Finally, we note that our formulation of the cluster-
cluster size is not uniform. This may be relevant in ing problem with optimal transport is closely linked
cases, for example, when an expert provides a rough to that of Dulac-Arnold et al. (2019), who propose an
estimate of the proportion of different classes, such as algorithm to learn a classifier from label proportions
the proportion of cancer cells in an histopathological in mini-batches. The main difference is that instead
image. While our formulation lends itself naturally to of using fθ to parametrize an embedding, the authors
non-uniform cluster proportions, we observed in pre- directly use it to predict the probability of belonging
liminary experiments that it performs poorly if no care to class k. The last layer of fθ is a softmax, and thus
is taken to make sure that the cluster size constraints the term kfθ (xi ) − µk k2 is replaced by fθ (xi )k . Be-
(w in Algorithm 2) is coherent with the set of clus- sides, they loosen the marginal constraint prescribing
ter centers (vectors µk in Algorithm 2). We found the clusters proportions by using unbalanced optimal
in particular that this is often poorly achieved by the transport. The latter can also be implemented in our
initialization of the centers via k-means (step 4 in Al- proposed method, as it consists in using a variation of
gorithm 2. For instance, consider the case where we Sinkhorn’s algorithm (Chizat et al., 2018). However,
have two clusters – say images of ones and images of the performance that they report for large batch sizes
twos – and we know that we should have 20 % of the in lower than what we report for the fully unsuper-
former and 80 % of the latter. However, if the k-means vised task in our experiments. This would make our
initialization of the centers puts the first center in the optimal transport approach a good candidate for the
middle of twos and the second center in the middle of learning with proportions problem.
ones, the algorithm will try to enforce a 80% propor-
tion on ones and 20% proportion on twos. Generally References
speaking, to ensure that we are enforcing the cluster
proportions properly, some sort of matching has to be Yoshua Bengio, Aaron Courville, and Pascal Vincent.
done before the learning phase between the indexes of Representation learning: A review and new perspec-
the clusters and the indexes of the classes. This could tives. IEEE Trans. Pattern Anal. Mach. Intell., 35
be done in a supervised way, by using an example from (8):1798–1828, August 2013.
each class to initialize the centers. This extension falls Guillermo Canas and Lorenzo Rosasco. Learning prob-
in the framework of one-shot learning, with an addi- ability measures with respect to optimal transport
tional knowledge on class proportions. metrics. In Advances in Neural Information Pro-
cessing Systems, pages 2492–2500, 2012.
Another extension of our method would be relax the
strict constraint of cluster proportions to a soft con- Jianlong Chang, Lingfeng Wang, Gaofeng Meng,
Shiming Xiang, and Chunhong Pan. Deep adaptive
image clustering. In Proceedings of the IEEE In- A. Joulin, F. Bach, and J. Ponce. Discriminative clus-
ternational Conference on Computer Vision, pages tering for image co-segmentation. In Proceedings
5879–5887, 2017. of the Conference on Computer Vision and Pattern
Recognition (CVPR), 2010.
Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer,
and François-Xavier Vialard. Scaling algorithms for Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez
unbalanced optimal transport problems. Mathemat- Rezende, and Max Welling. Semi-supervised learn-
ics of Computation, 87(314):2563–2609, 2018. ing with deep generative models. In Advances in
Neural Information Processing Systems 27: Annual
Marco Cuturi. Sorn distances: Lightspeed computa- Conference on Neural Information Processing Sys-
tion of optimal transport. In Advances in neural tems 2014, December 8-13 2014, Montreal, Quebec,
information processing systems, pages 2292–2300, Canada, pages 3581–3589, 2014.
2013.
Gabriel Peyré, Marco Cuturi, et al. Computational op-
Marco Cuturi and Arnaud Doucet. Fast computation timal transport. Foundations and Trends R in Ma-
of Wasserstein barycenters. In International Con- chine Learning, 11(5-6):355–607, 2019.
ference on Machine Learning, pages 685–693, 2014.
David Pollard. Quantization and the method of k-
K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and means. IEEE Transactions on Information theory,
H. Huang. Deep clustering via joint convolutional 28(2):199–205, 1982.
autoencoder embedding and relative entropy mini- Tim Salimans, Han Zhang, Alec Radford, and Dimitris
mization. In 2017 IEEE International Conference Metaxas. Improving gans using optimal transport.
on Computer Vision (ICCV), pages 5747–5756, Oct arXiv preprint arXiv:1803.05573, 2018.
2017. doi: 10.1109/ICCV.2017.612.
Chunfeng Song, Feng Liu, Yongzhen Huang, Liang
G. Dulac-Arnold, N. Zeghidour, M. Cuturi, L. Beyer, Wang, and Tieniu Tan. Auto-encoder based data
and J.-P. Vert. Deep multi-class learning from label clustering. In José Ruiz-Shulcloper and Gabriella
proportions. Technical Report 1905.12909, arXiv, Sanniti di Baja, editors, Progress in Pattern Recog-
2019. nition, Image Analysis, Computer Vision, and Ap-
Maziar Moradi Fard, Thibaut Thonet, and Eric plications, pages 117–124, Berlin, Heidelberg, 2013.
Gaussier. Deep k-means: Jointly clustering with k- Springer Berlin Heidelberg.
means and learning representations. arXiv preprint Junyuan Xie, Ross Girshick, and Ali Farhadi. Un-
arXiv:1806.10069, 2018. supervised deep embedding for clustering analysis.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. In Proceedings of the 33rd International Conference
Learning generative models with Sinkhorn diver- on International Conference on Machine Learning
gences. In International Conference on Artificial - Volume 48, ICML’16, pages 478–487. JMLR.org,
Intelligence and Statistics, 2018. 2016.
Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, and
Aude Genevay, Lénaic Chizat, Francis Bach, Marco
Mingyi Hong. Towards k-means-friendly spaces: Si-
Cuturi, and Gabriel Peyré. Sample complexity of
multaneous deep learning and clustering. In Doina
Sinkhorn divergences. In Proceedings of the 22nd
Precup and Yee Whye Teh, editors, Proceedings
International Conference on Artificial Intelligence
of the 34th International Conference on Machine
and Statistics (AISTATS), Naha, Okinawa, Japan,
Learning, volume 70 of Proceedings of Machine
2019.
Learning Research, pages 3861–3870, International
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Convention Centre, Sydney, Australia, 06–11 Aug
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron 2017. PMLR.
Courville, and Yoshua Bengio. Generative adversar- Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint
ial nets. In Z. Ghahramani, M. Welling, C. Cortes, unsupervised learning of deep representations and
N. D. Lawrence, and K. Q. Weinberger, editors, Ad- image clusters. 2016 IEEE Conference on Computer
vances in Neural Information Processing Systems Vision and Pattern Recognition (CVPR), Jun 2016.
27, pages 2672–2680. Curran Associates, Inc., 2014. doi: 10.1109/cvpr.2016.556.
Xifeng Guo, Long Gao, Xinwang Liu, and Jianping
Yin. Improved deep embedded clustering with local
structure preservation. In Proceedings of the 26th
International Joint Conference on Artificial Intel-
ligence, IJCAI’17, pages 1753–1759. AAAI Press,
2017. ISBN 978-0-9992411-0-3.

Xu, R., & Wunsch, D. (2005) - Survey of Clustering Algorithms
No ratings yet
Xu, R., & Wunsch, D. (2005) - Survey of Clustering Algorithms
35 pages
Lecture 22. Ideal Bose and Fermi Gas (Ch. 7) : Fermions: N Bosons: N
100% (2)
Lecture 22. Ideal Bose and Fermi Gas (Ch. 7) : Fermions: N Bosons: N
12 pages
PID Instr Sec 01 Introduction To Process Control
100% (1)
PID Instr Sec 01 Introduction To Process Control
38 pages
Assignment2 Group5B
No ratings yet
Assignment2 Group5B
60 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
PV Tables
No ratings yet
PV Tables
10 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
Simplex Method
No ratings yet
Simplex Method
15 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
No ratings yet
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
12 pages
A Comprehensive Survey On Deep Clustering - Taxonomy, Challenges, and Future Directions
No ratings yet
A Comprehensive Survey On Deep Clustering - Taxonomy, Challenges, and Future Directions
35 pages
Unit 6
No ratings yet
Unit 6
102 pages
Deep Learning As Optimal Control Problems - Models and Numerical Methods
No ratings yet
Deep Learning As Optimal Control Problems - Models and Numerical Methods
34 pages
Deep Clustering Based On Embedded Auto Encoder
No ratings yet
Deep Clustering Based On Embedded Auto Encoder
16 pages
Deterministic Annealing For Cluster Compres Classi Regres and Related Opti Prob
No ratings yet
Deterministic Annealing For Cluster Compres Classi Regres and Related Opti Prob
30 pages
XAI Beyond Classification: Interpretable Neural Clustering
No ratings yet
XAI Beyond Classification: Interpretable Neural Clustering
28 pages
Best-4 - Topological Gradient Based Competitive Learning
No ratings yet
Best-4 - Topological Gradient Based Competitive Learning
12 pages
Yang 2017
No ratings yet
Yang 2017
15 pages
Dual Autoencoder Clustering
No ratings yet
Dual Autoencoder Clustering
10 pages
Nurfatihah Khalid
No ratings yet
Nurfatihah Khalid
67 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Clustering
No ratings yet
Clustering
24 pages
Dijazi - Deep Clustering Via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization - 17
No ratings yet
Dijazi - Deep Clustering Via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization - 17
13 pages
1 s2.0 S0950705122010772 Main
No ratings yet
1 s2.0 S0950705122010772 Main
10 pages
2008 Boosting For Model-Based Data Clustering
No ratings yet
2008 Boosting For Model-Based Data Clustering
10 pages
Big Data: An Optimized Approach For Cluster Initialization: Open Access Research
No ratings yet
Big Data: An Optimized Approach For Cluster Initialization: Open Access Research
19 pages
2016-CVPR-Joint Unsupervised Learning of Deep Representations and Image Clusters
No ratings yet
2016-CVPR-Joint Unsupervised Learning of Deep Representations and Image Clusters
10 pages
1-Learning Deep Generative Clustering Via Mutual Information Maximization
No ratings yet
1-Learning Deep Generative Clustering Via Mutual Information Maximization
13 pages
Deep Multi-View Semi-Supervised Clustering
No ratings yet
Deep Multi-View Semi-Supervised Clustering
14 pages
2017 IDEC Guo
No ratings yet
2017 IDEC Guo
7 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
10 pages
Kmeansfinal
No ratings yet
Kmeansfinal
5 pages
Unsupervised Clustering For Deep Learnin
No ratings yet
Unsupervised Clustering For Deep Learnin
25 pages
Deep Learning With Nonparametric Clustering: Gang Chen
No ratings yet
Deep Learning With Nonparametric Clustering: Gang Chen
14 pages
08-Self-Taught Clustering
No ratings yet
08-Self-Taught Clustering
8 pages
Clustering
No ratings yet
Clustering
27 pages
Alqahtani 2018
No ratings yet
Alqahtani 2018
5 pages
Auto-Encoder Based Data Clustering: Abstract. Linear or Non-Linear Data Transformations Are Widely Used
No ratings yet
Auto-Encoder Based Data Clustering: Abstract. Linear or Non-Linear Data Transformations Are Widely Used
8 pages
Clustering
No ratings yet
Clustering
82 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
59 pages
Workshop 8: Numerical Differentiation and Integration
No ratings yet
Workshop 8: Numerical Differentiation and Integration
9 pages
SPH Modeling Using LS Dyna
No ratings yet
SPH Modeling Using LS Dyna
6 pages
1 s2.0 S174680942300722X Main
No ratings yet
1 s2.0 S174680942300722X Main
5 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
NIPS 2010 Towards Property Based Classification of Clustering Paradigms Paper
No ratings yet
NIPS 2010 Towards Property Based Classification of Clustering Paradigms Paper
9 pages
Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering
No ratings yet
Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering
12 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Branch and Bound
No ratings yet
Branch and Bound
13 pages
Clustering For Clasification
No ratings yet
Clustering For Clasification
13 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Convolution Presentation
No ratings yet
Convolution Presentation
65 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
SJNanda - Spider and CollidingBodies
No ratings yet
SJNanda - Spider and CollidingBodies
50 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Vasvi Khullar Mca - Iv (B) 06417704417
No ratings yet
Vasvi Khullar Mca - Iv (B) 06417704417
5 pages
Is Simple Better?: Revisiting Simple Generative Models For Unsupervised Clustering
No ratings yet
Is Simple Better?: Revisiting Simple Generative Models For Unsupervised Clustering
6 pages
Similarity-Based Learning Via Data Driven Embeddings
No ratings yet
Similarity-Based Learning Via Data Driven Embeddings
14 pages
23 Ex 5G Absolute Maximum and Minimum
No ratings yet
23 Ex 5G Absolute Maximum and Minimum
8 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
Robust Continuous Clustering: Sohil Atul Shah and Vladlen Koltun
No ratings yet
Robust Continuous Clustering: Sohil Atul Shah and Vladlen Koltun
6 pages
22 GCC
No ratings yet
22 GCC
9 pages
Ex 83622 2025 1
No ratings yet
Ex 83622 2025 1
2 pages
Main
No ratings yet
Main
5 pages
On Clustering Binary Data: Tao Li Shenghuo Zhu
No ratings yet
On Clustering Binary Data: Tao Li Shenghuo Zhu
5 pages
T. Villmann Et Al - Fuzzy Labeled Neural Gas For Fuzzy Classification
No ratings yet
T. Villmann Et Al - Fuzzy Labeled Neural Gas For Fuzzy Classification
8 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Toward Theoretical Foundations
No ratings yet
Toward Theoretical Foundations
6 pages
Robust Shape Matching With OT
No ratings yet
Robust Shape Matching With OT
175 pages
Maths (041) Xii PB 1 QP Set C
No ratings yet
Maths (041) Xii PB 1 QP Set C
7 pages
Defining Spatial Entropy From Multivariate Distributions of Co-Occurrences
No ratings yet
Defining Spatial Entropy From Multivariate Distributions of Co-Occurrences
14 pages
Uts No 3
No ratings yet
Uts No 3
3 pages
2461 Out of Sample Extensions For Lle Isomap Mds Eigenmaps and Spectral Clustering
No ratings yet
2461 Out of Sample Extensions For Lle Isomap Mds Eigenmaps and Spectral Clustering
8 pages
Transport Inequalities. A Survey
No ratings yet
Transport Inequalities. A Survey
82 pages
P1.7 Genetic Algorithms in Geophysical Fluid Dynamics
No ratings yet
P1.7 Genetic Algorithms in Geophysical Fluid Dynamics
7 pages
Distance Distributions and Inverse Problems For Metric Measure
No ratings yet
Distance Distributions and Inverse Problems For Metric Measure
71 pages
Information Theory, Pattern Recognition and Neural Networks: Part III Physics, January 2007
No ratings yet
Information Theory, Pattern Recognition and Neural Networks: Part III Physics, January 2007
2 pages
Block 4
No ratings yet
Block 4
96 pages
A Geometric View of Optimal Transportation and Generative Model
No ratings yet
A Geometric View of Optimal Transportation and Generative Model
21 pages
Spectral Distances On Graphs
No ratings yet
Spectral Distances On Graphs
11 pages
Hypergraph Co-Optimal Transport: Metric and Categorical Properties
No ratings yet
Hypergraph Co-Optimal Transport: Metric and Categorical Properties
21 pages
Neural Machine Translation Advised by Statistical Machine Translation
No ratings yet
Neural Machine Translation Advised by Statistical Machine Translation
7 pages
Simulation of Simple Random Double Auction
No ratings yet
Simulation of Simple Random Double Auction
1 page
A Linear Transportation LP Distance For Pattern Recognition
No ratings yet
A Linear Transportation LP Distance For Pattern Recognition
41 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
13 pages
The Emerging Field of Signal Processing On Graphs
No ratings yet
The Emerging Field of Signal Processing On Graphs
14 pages
Quantitative Stability of Regularized Optimal Transport
No ratings yet
Quantitative Stability of Regularized Optimal Transport
35 pages
AI ML Learning Plan
No ratings yet
AI ML Learning Plan
4 pages
Multi-Marginal Optimal Transport Defines A Generalized Metric
No ratings yet
Multi-Marginal Optimal Transport Defines A Generalized Metric
17 pages
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
No ratings yet
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
30 pages
KSEEB 1st PUC Statistics Syllabus 2021 22
No ratings yet
KSEEB 1st PUC Statistics Syllabus 2021 22
2 pages
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
No ratings yet
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
118 pages
Slides - Graph Signal Processing and Applications in Neuroscience
No ratings yet
Slides - Graph Signal Processing and Applications in Neuroscience
103 pages
ML Cheatsheet
No ratings yet
ML Cheatsheet
1 page
Uniqueness and Monge Solutions in The Multimarginal OT Problem
No ratings yet
Uniqueness and Monge Solutions in The Multimarginal OT Problem
20 pages
Comparative Evaluation of CNN Architectures For Image Caption Generation
No ratings yet
Comparative Evaluation of CNN Architectures For Image Caption Generation
9 pages
A Star Ai and ML Lab
No ratings yet
A Star Ai and ML Lab
3 pages
Understanding The Basis of Graph Signal Processing Via An Intuitive Example-Driven Approach
No ratings yet
Understanding The Basis of Graph Signal Processing Via An Intuitive Example-Driven Approach
10 pages
Clasification of Mango (Mangifera Indica L) Fruit Varieties Using CNN
No ratings yet
Clasification of Mango (Mangifera Indica L) Fruit Varieties Using CNN
7 pages
Chapter 6 Logarithmic and Exponential Functions
No ratings yet
Chapter 6 Logarithmic and Exponential Functions
7 pages
Linear Mathematical Models 2021
No ratings yet
Linear Mathematical Models 2021
7 pages
2010 8FM0-23 Further Statistics 1 - October 2020 PDF
No ratings yet
2010 8FM0-23 Further Statistics 1 - October 2020 PDF
12 pages
Tensor Optimal Transport Distance Between Sets of
No ratings yet
Tensor Optimal Transport Distance Between Sets of
33 pages
Superposition, Memorization, and Double Descent
No ratings yet
Superposition, Memorization, and Double Descent
30 pages
A Multiscale Approach To Optimal Transport
No ratings yet
A Multiscale Approach To Optimal Transport
19 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
EAI Endorsed Transactions: Prediction of Dogecoin Price Using Deep Learning and Social Media Trends
No ratings yet
EAI Endorsed Transactions: Prediction of Dogecoin Price Using Deep Learning and Social Media Trends
12 pages