11institutetext: Systems Engineering Institute, Xi’an Jiaotong University
11email: [email protected], [email protected], [email protected]

Deep Online Probability Aggregation Clustering

Yuxuan Yan    Na Lu Corresponding author    Ruofan Yan
Abstract

Combining machine clustering with deep models has shown remarkable superiority in deep clustering. It modifies the data processing pipeline into two alternating phases: feature clustering and model training. However, such alternating schedules may lead to instability and computational burden issues. To tackle these problems, we propose a centerless clustering algorithm called Probability Aggregation Clustering (PAC), enabling easy deployment in online deep clustering. PAC circumvents the cluster center and aligns the probability space and distribution space by formulating clustering as an optimization problem with a novel objective function. Based on the computation mechanism of the PAC, we propose a general online probability aggregation module to perform stable and flexible feature clustering over mini-batch data and further construct a deep visual clustering framework deep PAC (DPAC). Extensive experiments demonstrate that DPAC remarkably outperforms the state-of-the-art deep clustering methods.111The code is available at https://github.com/aomandechenai/Deep-Probability-Aggregation-Clustering

Keywords:
Deep Online Clustering Unsupervised Learning Fuzzy Clustering

1 Introduction

Clustering analysis [3] is a widely explored domain in the field of unsupervised learning, aiming to group the unlabeled samples into clusters that have common characteristics. Conventional machine clustering is favored by many researchers due to its significant interpretability and stable optimization. In recent years, deep clustering has received more attention due to its powerful representation extraction capabilities. Previous deep clustering models [55, 56, 24, 8] directly combine deep networks with machine clustering and utilize designed loss functions to guide both representation learning and clustering. For example, Deepcluster [9] and PCL [34] decouple representation learning and clustering to leverage the offline pseudo labels of K-means (KM) to cluster images. Unfortunately, these offline methods typically require running multiple times of standard KM over the entire dataset, which brings much time and space complexity. Besides, simply grouping data in batches instead of the whole dataset to obtain online clustering causes collapsing and degradation issues. To address these problems, researchers have given two dominant solutions: batch clustering and contrastive clustering.

Batch clustering [57, 20, 30, 38] focuses on modifying the conventional machine clustering algorithms [59] to adapt the data flow of deep models, which has high extensibility. For example, Online Deep Clustering (ODC) [57] decomposes the standard KM process into batch clustering with memory banks and optimizes the clustering and network shoulder-to-shoulder (online) to facilitate stable learning. CoKe [42] proposes the moving average strategy to reassignment pseudo labels and introduces Constrained K-means [7] into training to ensure the minimal size of clusters to avoid collapsing. Most existing batch clustering approaches focus more on center-based machine clustering algorithms, such as KM and fuzzy c-means (FCM) [6], which require specially designed center update rules. Moreover, center-based machine clustering is easily susceptible to the influence of cluster center [22, 4]. Random initialization of cluster centers introduces instability to subsequent training. Partitioning based on nearest centers cannot provide fine-grained discrimination hyperplanes for clusters, affecting clustering performance.

Recently, contrastive clustering [46, 36, 49, 60] has achieved significant success in online deep clustering. Contrastive methods perform online clustering by exploring multi-view correlations of data. Formally, instances are augmented into two views using random data augmentation to build contrastive frameworks. The clustering process is then performed by minimizing the designed contrastive loss. For example, PICA [28] proposes cluster-level contrastive loss based on contrastive framework to perform online deep clustering. However, the establishment of contrastive approaches needs a lot of artificial knowledge, including data augmentation, hyperparameter setting, and model architecture. Contrastive models often need thousands of epochs to reach convergence. Besides, they make a balanced assumption for clustering (i.e. each cluster has the same number of samples), which requires additional regular terms to constrain optimization and avoid crash problems (i.e. a few clusters have a majority of instances). The essence of contrastive clustering methods is to leverage the nearest-neighbor relationship of augmented instances in the semantic space to unsupervisedly train the classifier. Such semantic nearest-neighbor learning only uses a portion of data and its corresponding augmented version, failing to capture the global cluster relationship [13] and encode spatial embedding distribution.

In this work, considering the adverse effect of the cluster center, we first introduce a novel objective function quantifying the intra-cluster distances without cluster centers. Furthermore, inspired by fuzzy c-means, a concise optimization program is formulated by incorporating a fuzzy weighting exponent into an objective function. Then we build a centerless machine clustering algorithm called Probability Aggregation Clustering (PAC). In the optimization program of PAC, the probability of one sample belonging to a cluster is aggregated across samples with distance information in an iterative way. Unlike KM which assigns instances by cluster centers, PAC directly outputs probabilities which is more stable and easy to deploy in deep models. Therefore, we extend the PAC to the online probability aggregation module (OPA), a simple plug-in component for online deep clustering tasks. OPA seamlessly combines the calculation process of PAC with loss computation. It overcomes the disadvantages of both batch and contrastive clustering and implements efficient clustering. Besides, OPA does not impose any constraints on the size of clusters, mitigating the suboptimal solutions introduced by balanced clustering and obtaining more flexible partitioning. It computes clustering codes with the batches of data and updates the network by KL divergence, which leaves out the complicated clustering steps and trains the model in a supervised manner. Based on the above theories, a deep image clustering model Deep PAC (DPAC) is established, which ensures stable learning, global clustering, and superior performance. The major contributions of this work include:

  • A novel centerless partition clustering method PAC is proposed to implement clustering by exploring the potential relation between sample distribution and assignment probability.

  • An online deep clustering module OPA is exploited based on PAC, which encodes spatial distances into online clustering without incorporating plenty hyper-parameters and components. It leaves out the cluster size constraints to perform flexible partitioning.

  • A simple end-to-end unsupervised deep clustering framework DPAC is established for stable and efficient clustering. DPAC achieves significant performance on five challenging image benchmarks compared with the state-of-the-art approaches.

2 Related Work

2.0.1 Deep Clustering:

Deep clustering methods [12, 18, 46] combine representation learning with clustering through deep models. ProPos [29] proposes the prototype scattering loss to make full use of K-means pseudo labels. Deepdpm [43] is a density-based approach, which does not require the preset number of class. Different from the above, recent deep clustering methods assume that the output is uniform. SwAV [10] and SeLa [5] adopt a balanced cluster discrimination task via the Sinkhorn-Knopp algorithm. SCAN [50] leverages K-nearest-neighbor information to group samples. Its loss maximizes the agreements of assignments among neighbors, which inevitably need an additional balanced cluster constraint to avoid trivial solutions. SeCu [41] employs a global entropy constraint to relax the balanced constraint to a lower-bound size constraint that limits the minimal size of clusters.

2.0.2 Machine Clustering:

Machine clustering [11, 33, 27] tries to decompose the data into a set of disjoint clusters by machine learning algorithms. FCM [6] obtains soft cluster assignment by alternately updating the fuzzy partition matrix and cluster center. Many modified7 methods [37, 51, 33] aim at improving the performance and robustness of center-based clustering. In addition, nonparametric methods [21, 23] have received more and more attention in recent years. FINCH [44] performs hierarchical agglomerative clustering based on first-neighbor relations without requiring a specific number of clusters. However, the complex clustering progresses involved in these algorithms hinder their easy deployment in neural networks.

3 Method

The following sections present the theoretical basis of our approach. We first derive a novel objective function and analyze how the proposed objective function relates to existing methods. Second, we present a scalable centerless clustering algorithm PAC. Finally, we extend PAC to a novel online clustering module OPA, and construct a novel online deep clustering model DPAC to learn the semantic knowledge of unlabeled data.

3.1 Objective Function

Let 𝑿={𝒙1,𝒙2,,𝒙N}𝑿subscript𝒙1subscript𝒙2subscript𝒙𝑁\boldsymbol{X}=\{\boldsymbol{x}_{1},\boldsymbol{x}_{2},\cdots,\boldsymbol{x}_{% N}\}bold_italic_X = { bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } be an N𝑁Nitalic_N-point dataset, where 𝒙iD×1subscript𝒙𝑖superscript𝐷1\boldsymbol{x}_{i}\in\mathbb{R}^{D\times 1}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 1 end_POSTSUPERSCRIPT is the i-th D𝐷Ditalic_D-dimensional instance. The clustering algorithm aims to divide 𝑿𝑿\boldsymbol{X}bold_italic_X into K𝐾Kitalic_K mutually disjoint clusters, where 2K<N2𝐾𝑁2\leq K<N2 ≤ italic_K < italic_N, K𝐾K\in\mathbb{N}italic_K ∈ blackboard_N. 𝑷=[pi,k]N×K𝑷subscriptdelimited-[]subscript𝑝𝑖𝑘𝑁𝐾\boldsymbol{P}=[p_{i,k}]_{N\times K}bold_italic_P = [ italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_N × italic_K end_POSTSUBSCRIPT is the soft partition matrix, pi,ksubscript𝑝𝑖𝑘p_{i,k}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT is the probability of one sample belonging to certain cluster indicating the relationship between sample 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cluster k𝑘kitalic_k which satisfies 𝑷{ΓN×K|γi,k[0,1],i,k;k=1Kγi,k=1,i;0<i=1Nγi,k<N,k}𝑷conditional-setsuperscriptΓ𝑁𝐾formulae-sequenceformulae-sequencesubscript𝛾𝑖𝑘01for-all𝑖𝑘formulae-sequencesuperscriptsubscript𝑘1𝐾subscript𝛾𝑖𝑘1for-all𝑖0superscriptsubscript𝑖1𝑁subscript𝛾𝑖𝑘𝑁for-all𝑘\boldsymbol{P}\in\{\Gamma^{N\times K}|\gamma_{i,k}\in[0,1],\forall i,k;\quad% \sum_{k=1}^{K}\gamma_{i,k}=1,\forall i;\quad 0<\sum_{i=1}^{N}\gamma_{i,k}<N,% \forall k\}bold_italic_P ∈ { roman_Γ start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT | italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] , ∀ italic_i , italic_k ; ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = 1 , ∀ italic_i ; 0 < ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT < italic_N , ∀ italic_k }. And the cluster prediction of 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be predicted by pi^=argmaxkpi,k, 1<kKformulae-sequence^subscript𝑝𝑖subscript𝑘subscript𝑝𝑖𝑘1𝑘𝐾\displaystyle\hat{p_{i}}=\arg\max\limits_{k}p_{i,k},\ 1<k\leq Kover^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , 1 < italic_k ≤ italic_K.

Different from the existing classical center-based methods [6, 53], we utilize the inner product operation of probability vectors instead of cluster center to indicate cluster relations of samples. Formally, we multiply the inner product results with corresponding distance measurements to quantify the global intra-cluster distance of the data. The objective function Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT is defined as:

Jpac=i=1Nj=1N𝒑i𝖳𝒑j𝒙i𝒙j2,subscript𝐽𝑝𝑎𝑐superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁superscriptsubscript𝒑𝑖𝖳subscript𝒑𝑗superscriptnormsubscript𝒙𝑖subscript𝒙𝑗2J_{pac}=\sum_{i=1}^{N}\sum_{j=1}^{N}\boldsymbol{p}_{i}^{\mathsf{T}}\boldsymbol% {p}_{j}\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\|^{2},italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where 𝒑i=[pi,1,pi,2,,pi,K]𝖳subscript𝒑𝑖superscriptsubscript𝑝𝑖1subscript𝑝𝑖2subscript𝑝𝑖𝐾𝖳\boldsymbol{p}_{i}=[p_{i,1},p_{i,2},\ldots,p_{i,K}]^{\mathsf{T}}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_i , italic_K end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT is the probability vector. 𝒑i𝖳𝒑j[0,1]superscriptsubscript𝒑𝑖𝖳subscript𝒑𝑗01\boldsymbol{p}_{i}^{\mathsf{T}}\boldsymbol{p}_{j}\in[0,1]bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] can be regarded as the probability weight for 𝒙i𝒙j2superscriptnormsubscript𝒙𝑖subscript𝒙𝑗2\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By minimizing Eq. 1, 𝒑i𝖳𝒑jsuperscriptsubscript𝒑𝑖𝖳subscript𝒑𝑗\boldsymbol{p}_{i}^{\mathsf{T}}\boldsymbol{p}_{j}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be negatively related to 𝒙i𝒙j2superscriptnormsubscript𝒙𝑖subscript𝒙𝑗2\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which denotes the probabilities of instances consistent with nearby samples, but not with distant samples.

3.2 Relation to Existing Methods

We provide a new perspective to further understand the proposed objective function. We summarize the difference between our method and Spectral Clustering (SC) [51] and SCAN [50]. The minimizing problem for Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT can be rewritten as:

min𝑷ΓN×KTr(𝑷𝖳𝑫~x𝑷),subscript𝑷superscriptΓ𝑁𝐾𝑇𝑟superscript𝑷𝖳subscript~𝑫𝑥𝑷\min\limits_{\boldsymbol{P}\in\Gamma^{N\times K}}{}Tr(\boldsymbol{P}^{\mathsf{% T}}\widetilde{\boldsymbol{D}}_{x}\boldsymbol{P}),roman_min start_POSTSUBSCRIPT bold_italic_P ∈ roman_Γ start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_T italic_r ( bold_italic_P start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_P ) , (2)

where 𝑫~xsubscript~𝑫𝑥\widetilde{\boldsymbol{D}}_{x}over~ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the distances matrix, d~i,j=𝒙i𝒙j2subscript~𝑑𝑖𝑗superscriptnormsubscript𝒙𝑖subscript𝒙𝑗2\widetilde{d}_{i,j}=\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\|^{2}over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Obviously, d~i,jsubscript~𝑑𝑖𝑗\widetilde{d}_{i,j}over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be replaced by many other distance measurement. We use L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance as the default distance measure in the following experiments. The graph partitioning problem of SC is formulated as:

min𝑯N×KTr(𝑯𝖳𝑳~x𝑯),subscript𝑯superscript𝑁𝐾𝑇𝑟superscript𝑯𝖳subscript~𝑳𝑥𝑯\displaystyle\min\limits_{\boldsymbol{H}\in\mathbb{R}^{N\times K}}{}Tr(% \boldsymbol{H}^{\mathsf{T}}\widetilde{\boldsymbol{L}}_{x}\boldsymbol{H}),roman_min start_POSTSUBSCRIPT bold_italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_T italic_r ( bold_italic_H start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_H ) , (3)
s.t.𝑯𝖳𝑯=𝑰,formulae-sequencestsuperscript𝑯𝖳𝑯𝑰\displaystyle\begin{array}[]{r}\mathrm{s.t.}\ \boldsymbol{H}^{\mathsf{T}}% \boldsymbol{H}=\boldsymbol{I},\\ \end{array}start_ARRAY start_ROW start_CELL roman_s . roman_t . bold_italic_H start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_H = bold_italic_I , end_CELL end_ROW end_ARRAY

where 𝑳~xsubscript~𝑳𝑥\widetilde{\boldsymbol{L}}_{x}over~ start_ARG bold_italic_L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the Laplacian matrix of graph. The indicator matrix 𝑯𝑯\boldsymbol{H}bold_italic_H contains arbitrary real values with orthogonality constraint. The semantic clustering loss in SCAN can be reformulated as:

max𝑷ΓN×Ki=1Nj𝒩ilog𝒑i𝖳𝒑jλ(𝑷)subscript𝑷superscriptΓ𝑁𝐾superscriptsubscript𝑖1𝑁subscript𝑗subscript𝒩𝑖superscriptsubscript𝒑𝑖𝖳subscript𝒑𝑗𝜆𝑷\displaystyle\max\limits_{\boldsymbol{P}\in\Gamma^{N\times K}}\sum_{i=1}^{N}% \sum_{j\in\mathcal{N}_{i}}\log{\boldsymbol{p}_{i}^{\mathsf{T}}\boldsymbol{p}_{% j}}-\lambda\mathcal{H}(\boldsymbol{P})roman_max start_POSTSUBSCRIPT bold_italic_P ∈ roman_Γ start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_λ caligraphic_H ( bold_italic_P ) (4)
\displaystyle\Leftrightarrow max𝑷ΓN×KTr(𝑷𝖳𝑨~x𝑷)λ(𝑷),subscript𝑷superscriptΓ𝑁𝐾𝑇𝑟superscript𝑷𝖳subscript~𝑨𝑥𝑷𝜆𝑷\displaystyle\max\limits_{\boldsymbol{P}\in\Gamma^{N\times K}}Tr(\boldsymbol{P% }^{\mathsf{T}}\widetilde{\boldsymbol{A}}_{x}\boldsymbol{P})-\lambda\mathcal{H}% (\boldsymbol{P}),roman_max start_POSTSUBSCRIPT bold_italic_P ∈ roman_Γ start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_T italic_r ( bold_italic_P start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_italic_P ) - italic_λ caligraphic_H ( bold_italic_P ) ,

where (𝑷)=k=1Ki=1Npi,kNlogi=1Npi,kN𝑷superscriptsubscript𝑘1𝐾superscriptsubscript𝑖1𝑁subscript𝑝𝑖𝑘𝑁superscriptsubscript𝑖1𝑁subscript𝑝𝑖𝑘𝑁\mathcal{H}(\boldsymbol{P})=\sum_{k=1}^{K}\frac{\sum_{i=1}^{N}p_{i,k}}{N}\log% \frac{\sum_{i=1}^{N}p_{i,k}}{N}caligraphic_H ( bold_italic_P ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG roman_log divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG, 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the K𝐾Kitalic_K nearest neighbor set of instance i𝑖iitalic_i, 𝑨~xsubscript~𝑨𝑥\widetilde{\boldsymbol{A}}_{x}over~ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the adjacent matrix, a~i,j=1subscript~𝑎𝑖𝑗1\widetilde{a}_{i,j}=1over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 when j𝒩i𝑗subscript𝒩𝑖j\in\mathcal{N}_{i}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, otherwise a~i,j=0subscript~𝑎𝑖𝑗0\widetilde{a}_{i,j}=0over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0. λ𝜆\lambdaitalic_λ is the hyper-parameter. The second term (𝑷)𝑷\mathcal{H}(\boldsymbol{P})caligraphic_H ( bold_italic_P ) in Eq. 4 denotes balanced constrain of cluster. Compared with Eq. 3, Eq. 2 transforms the partitioning problem in Euclidean space into the graph-cut problem. And different from balanced partitioning in Eq. 4, we convert the maximum problem to the minimum problem to efficiently avoid trivial solutions. The intrinsical constraints of probability matrix 𝑷𝑷\boldsymbol{P}bold_italic_P enable Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT directly clustering without using orthogonality and balanced constraints. Therefore, DPAC does not require additional clustering regular terms [50, 46, 35] to avoid collapse and performs more flexible cluster assignment. Moreover, unlike only using neighbors to group, Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT introduces the distance information into optimization to obtain a global clustering.

3.3 Probability Aggregation Clustering

The proposed Eq. 2 is a constrained optimization problem. Inspired by FCM, we incorporate the fuzzy weighting exponent m𝑚mitalic_m into the objective function and obtain a scalable machine clustering algorithm based on the Lagrange method. The new objective function with m𝑚mitalic_m can be formulated as:

J~pac=i=1Nj=1Nφ(i,j)d~i,j,with φ(i,j)=k=1Kpi,kmpj,k,formulae-sequencesubscript~𝐽𝑝𝑎𝑐superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁𝜑𝑖𝑗subscript~𝑑𝑖𝑗with 𝜑𝑖𝑗superscriptsubscript𝑘1𝐾superscriptsubscript𝑝𝑖𝑘𝑚subscript𝑝𝑗𝑘\tilde{J}_{pac}=\sum_{i=1}^{N}\sum_{j=1}^{N}\varphi(i,j)\tilde{d}_{i,j},\quad% \text{with }\varphi(i,j)=\sum_{k=1}^{K}p_{i,k}^{m}p_{j,k},over~ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ ( italic_i , italic_j ) over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , with italic_φ ( italic_i , italic_j ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT , (5)

where m(1,+)𝑚1m\in(1,+\infty)italic_m ∈ ( 1 , + ∞ ). The corresponding Lagrange function is:

L~pac=i=1Njiφ(i,j)d~i,j+i=1Nλi(1k=1Kpi,k)i=1Nk=1Kγi,kpi,k,subscript~𝐿𝑝𝑎𝑐superscriptsubscript𝑖1𝑁subscript𝑗𝑖𝜑𝑖𝑗subscript~𝑑𝑖𝑗superscriptsubscript𝑖1𝑁subscript𝜆𝑖1superscriptsubscript𝑘1𝐾subscript𝑝𝑖𝑘superscriptsubscript𝑖1𝑁superscriptsubscript𝑘1𝐾subscript𝛾𝑖𝑘subscript𝑝𝑖𝑘\tilde{L}_{pac}=\sum_{i=1}^{N}\sum_{j\neq i}\varphi(i,j)\tilde{d}_{i,j}+\sum_{% i=1}^{N}\lambda_{i}(1-\sum_{k=1}^{K}p_{i,k})-\sum_{i=1}^{N}\sum_{k=1}^{K}% \gamma_{i,k}p_{i,k},over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_φ ( italic_i , italic_j ) over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , (6)

where λsubscript𝜆\lambda_{\cdot}italic_λ start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT and γ,subscript𝛾\gamma_{\cdot,\cdot}italic_γ start_POSTSUBSCRIPT ⋅ , ⋅ end_POSTSUBSCRIPT are the Lagrange multipliers respectively for the sum constraint and the non-negativity constraint on 𝑷𝑷\boldsymbol{P}bold_italic_P. The partial derivative of L~pacsubscript~𝐿𝑝𝑎𝑐\widetilde{L}_{pac}over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT with respect to pi,ksubscript𝑝𝑖𝑘p_{i,k}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT should be equal to zero at the minimum as:

L~pacpi,k=2jimpi,km1pj,kd~i,jλiγi,k=0.subscript~𝐿𝑝𝑎𝑐subscript𝑝𝑖𝑘2subscript𝑗𝑖𝑚superscriptsubscript𝑝𝑖𝑘𝑚1subscript𝑝𝑗𝑘subscript~𝑑𝑖𝑗subscript𝜆𝑖subscript𝛾𝑖𝑘0\frac{\partial\tilde{L}_{pac}}{\partial p_{i,k}}=2\sum_{j\neq i}mp_{i,k}^{m-1}% p_{j,k}\tilde{d}_{i,j}-\lambda_{i}-\gamma_{i,k}=0.\\ divide start_ARG ∂ over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG = 2 ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_m italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = 0 . (7)

And according to the Karush-Kuhn-Tucker conditions we have: 1k=1Kpi,k=0,γi,kpi,k=0,γi,k0,i,k.formulae-sequence1superscriptsubscript𝑘1𝐾subscript𝑝𝑖𝑘0formulae-sequencesubscript𝛾𝑖𝑘subscript𝑝𝑖𝑘0subscript𝛾𝑖𝑘0for-all𝑖𝑘1-\sum_{k=1}^{K}p_{i,k}=0,\ \gamma_{i,k}p_{i,k}=0,\ \gamma_{i,k}\geq 0,\ % \forall i,k.1 - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = 0 , italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = 0 , italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ≥ 0 , ∀ italic_i , italic_k . For soft clustering, endpoints are generally unreachable during optimization. Therefore, we only consider the case when pi,k(0,1)subscript𝑝𝑖𝑘01p_{i,k}\in(0,1)italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 ), γi,k=0subscript𝛾𝑖𝑘0\gamma_{i,k}=0italic_γ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = 0. Let α=1/(m1)𝛼1𝑚1\alpha=1/(m-1)italic_α = 1 / ( italic_m - 1 ), it can be obtained from Eq. 7 that pi,k=λiα(2mjipj,kd~i,j)αsubscript𝑝𝑖𝑘superscriptsubscript𝜆𝑖𝛼superscript2𝑚subscript𝑗𝑖subscript𝑝𝑗𝑘subscript~𝑑𝑖𝑗𝛼p_{i,k}=\lambda_{i}^{\alpha}{(2m\sum_{j\neq i}p_{j,k}\tilde{d}_{i,j})}^{-\alpha}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( 2 italic_m ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT. Considering the sum constraint, the equation becomes λiαk=1K(2mjipj,kd~i,j)α=k=1Kpi,k=1superscriptsubscript𝜆𝑖𝛼superscriptsubscript𝑘1𝐾superscript2𝑚subscript𝑗𝑖subscript𝑝𝑗𝑘subscript~𝑑𝑖𝑗𝛼superscriptsubscript𝑘1𝐾subscript𝑝𝑖𝑘1\lambda_{i}^{\alpha}\sum_{k=1}^{K}{(2m\sum_{j\neq i}p_{j,k}\tilde{d}_{i,j})}^{% -\alpha}=\sum_{k=1}^{K}p_{i,k}=1italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( 2 italic_m ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = 1. By solving λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and taking it into Eq. 7, we can finally obtain:

pi,k=si,kαr=1Ksi,rα,with si,k=jipj,kd~i,j.formulae-sequencesubscript𝑝𝑖𝑘superscriptsubscript𝑠𝑖𝑘𝛼superscriptsubscript𝑟1𝐾superscriptsubscript𝑠𝑖𝑟𝛼with subscript𝑠𝑖𝑘subscript𝑗𝑖subscript𝑝𝑗𝑘subscript~𝑑𝑖𝑗p_{i,k}=\frac{s_{i,k}^{-\alpha}}{\sum_{r=1}^{K}s_{i,r}^{-\alpha}},\quad\text{% with }s_{i,k}=\sum_{j\neq i}p_{j,k}\tilde{d}_{i,j}.\\ italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = divide start_ARG italic_s start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG , with italic_s start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT . (8)

Take one element pi,ksubscript𝑝𝑖𝑘p_{i,k}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT as a variable and all the rest elements as constant, 𝑷𝑷\boldsymbol{P}bold_italic_P can be iteratively updated with Eq. 8. si,ksubscript𝑠𝑖𝑘s_{i,k}italic_s start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT aggregates the probabilities and distances to compute a score that 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to cluster k𝑘kitalic_k. In other words, PAC solves pi,ksubscript𝑝𝑖𝑘p_{i,k}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT through all other instances instead of cluster centers. PAC only needs to initialize the 𝑷𝑷\boldsymbol{P}bold_italic_P following approximately uniform distribution, that is pi,k1/Ksubscript𝑝𝑖𝑘1𝐾p_{i,k}\approx 1/Kitalic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ≈ 1 / italic_K. Therefore, PAC circumvents the delicate cluster center initialization problem caused by disparate data distributions in the feature space [4]. The detailed steps of PAC are summarized in Algorithm 1.

1 Input: dataset 𝑿𝑿\boldsymbol{X}bold_italic_X; weighting exponent m𝑚mitalic_m; cluster number K𝐾Kitalic_K; initialization 𝑷𝑷\boldsymbol{P}bold_italic_P.
2 while not converage do
3      for i1𝑖1i\leftarrow 1italic_i ← 1 to N𝑁Nitalic_N do
4            for k1𝑘1k\leftarrow 1italic_k ← 1 to K𝐾Kitalic_K do
5                  pi,kEq. 8subscript𝑝𝑖𝑘Eq. 8p_{i,k}\leftarrow\text{\lx@cref{creftype~refnum}{eq:eqn-8}}italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ←
6             end for
7            
8       end for
9      
10 end while
11Output: Clustering result 𝑷𝑷\boldsymbol{P}bold_italic_P
Algorithm 1 PAC Program

3.4 Online Probability Aggregation

A deep neural network 𝒙^i=f(𝑰i)subscript^𝒙𝑖𝑓subscript𝑰𝑖\hat{\boldsymbol{x}}_{i}=f(\boldsymbol{I}_{i})over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) maps data 𝑰isubscript𝑰𝑖\boldsymbol{I}_{i}bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to feature vector 𝒙^isubscript^𝒙𝑖\hat{\boldsymbol{x}}_{i}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. And a classifier hhitalic_h maps 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to K-dimensional class probability 𝒑^isubscript^𝒑𝑖\hat{\boldsymbol{p}}_{i}over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We proposed a novel online clustering module OPA, which combines the optimization process of PAC with loss computation to generate pseudo labels step by step. Specifically, B𝐵Bitalic_B is the size of the mini-batch in the current epoch, OPA has two alternate steps:

3.4.1 Target Computation:

Sec. 3.3 demonstrates the optimization program for a single variable, we extend it to the matrix to adopt multivariable. Given the current model hf𝑓h\circ fitalic_h ∘ italic_f, the clustering score 𝑺+B×K𝑺superscript𝐵𝐾\boldsymbol{S}\in\mathbb{R}^{+B\times K}bold_italic_S ∈ blackboard_R start_POSTSUPERSCRIPT + italic_B × italic_K end_POSTSUPERSCRIPT is calculated by:

𝑺=𝑫~x^𝑷^.𝑺subscript~𝑫^𝑥^𝑷\boldsymbol{S}=\widetilde{\boldsymbol{D}}_{\hat{x}}\hat{\boldsymbol{P}}.bold_italic_S = over~ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT over^ start_ARG bold_italic_P end_ARG . (9)

The target clustering code 𝑸ΓB×K𝑸superscriptΓ𝐵𝐾\boldsymbol{Q}\in{\Gamma}^{B\times K}bold_italic_Q ∈ roman_Γ start_POSTSUPERSCRIPT italic_B × italic_K end_POSTSUPERSCRIPT can be obtained by normalizing 𝑺𝑺\boldsymbol{S}bold_italic_S, qi,k=si,kα/r=1Ksi,rαsubscript𝑞𝑖𝑘superscriptsubscript𝑠𝑖𝑘𝛼superscriptsubscript𝑟1𝐾superscriptsubscript𝑠𝑖𝑟𝛼q_{i,k}={s_{i,k}^{-\alpha}}/{\sum_{r=1}^{K}s_{i,r}^{-\alpha}}italic_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT. We call the operation in Eq. 9 as online probability aggregation. The probability outputs form the classifier are aggregated by matrix multiplication to compute corresponding scores, which not only incorporates historical partitioning knowledge but also encodes distance information.

3.4.2 Self-labeling:

Given the current target clustering code 𝑸𝑸\boldsymbol{Q}bold_italic_Q, the whole model hf𝑓h\circ fitalic_h ∘ italic_f is updated by minimizing the following KL divergence:

KL(𝑸𝑷^)=i=1Nk=1Kqi,klogqi,kp^i,k𝐾𝐿conditional𝑸^𝑷superscriptsubscript𝑖1𝑁superscriptsubscript𝑘1𝐾subscript𝑞𝑖𝑘subscript𝑞𝑖𝑘subscript^𝑝𝑖𝑘KL(\boldsymbol{Q}\parallel\hat{\boldsymbol{P}})=\sum_{i=1}^{N}\sum_{k=1}^{K}q_% {i,k}\log\frac{q_{i,k}}{\hat{p}_{i,k}}italic_K italic_L ( bold_italic_Q ∥ over^ start_ARG bold_italic_P end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG (10)

Different from directly leveraging Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT in Eq. 1 as clustering loss, OPA trains the model in a supervised way instead of solving the clustering problem in Eq. 2 exactly. The pseudo code of OPA is illustrated in Algorithm 2, which only involves a mini-batch matrix multiplication and power, so the computation cost of OPA equals general loss.

1 Input: distance matrix D𝐷Ditalic_D; probability matrix P𝑃Pitalic_P; weighting exponent m𝑚mitalic_m.
2   S=𝑆absentS=italic_S = torch.matmul(D(D( italic_D.detach(),P),P), italic_P ) // Aggregate Probability
3   S=𝑆absentS=italic_S = torch.pow(S,1/(m1))𝑆1𝑚1(S,-1/(m-1))( italic_S , - 1 / ( italic_m - 1 ) ) // Scale Up
4   Q=S/S𝑄𝑆𝑆Q=S/Sitalic_Q = italic_S / italic_S.sum(1).view(-1,1) // Normalize to 1
5 Output: (QlogQQlogP)𝑄𝑄𝑄𝑃(Q*\log Q-Q*\log P)( italic_Q ∗ roman_log italic_Q - italic_Q ∗ roman_log italic_P ).sum(1).mean() // KL divergence loss
Algorithm 2 Pseudo code for OPA in pytorch-style

3.5 Deep Probability Aggregation Clustering

With the proposed loss function, we construct an online deep clustering framework DPAC, which has two heads: contrastive learning and online clustering. Let 𝑰^i1subscriptsuperscript^𝑰1𝑖\hat{\boldsymbol{I}}^{1}_{i}over^ start_ARG bold_italic_I end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝑰^i2subscriptsuperscript^𝑰2𝑖\hat{\boldsymbol{I}}^{2}_{i}over^ start_ARG bold_italic_I end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote two-view features of 𝑰^isubscript^𝑰𝑖\hat{\boldsymbol{I}}_{i}over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generated by random image augmentation. We reformulate the standard contrastive loss in SimCLR [13] as weight contrastive loss (WCL) to mitigate the semantic distortion caused by negative samples. The weight contrastive loss (𝑿^1,𝑿^2,𝑷^)superscript^𝑿1superscript^𝑿2^𝑷\ell(\hat{\boldsymbol{X}}^{1},\hat{\boldsymbol{X}}^{2},\hat{\boldsymbol{P}})roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) is defined as:

i=1Nlogexp(𝒛^i1𝖳𝒛^i2/τ)jiw^i,jexp(𝒛^i1𝖳𝒛^j1/τ)+j=1Nw^i,jexp(𝒛^i1𝖳𝒛^j2/τ),superscriptsubscript𝑖1𝑁superscriptsubscript^𝒛𝑖1𝖳superscriptsubscript^𝒛𝑖2𝜏subscript𝑗𝑖subscript^𝑤𝑖𝑗superscriptsubscript^𝒛𝑖1𝖳superscriptsubscript^𝒛𝑗1𝜏superscriptsubscript𝑗1𝑁subscript^𝑤𝑖𝑗superscriptsubscript^𝒛𝑖1𝖳superscriptsubscript^𝒛𝑗2𝜏\begin{split}-\sum_{i=1}^{N}\log{\frac{\exp{(\hat{\boldsymbol{z}}_{i}^{1% \mathsf{T}}\hat{\boldsymbol{z}}_{i}^{2}/\tau)}}{\sum_{j\neq i}\hat{w}_{i,j}% \exp{(\hat{\boldsymbol{z}}_{i}^{1\mathsf{T}}\hat{\boldsymbol{z}}_{j}^{1}/\tau)% }+\sum_{j=1}^{N}\hat{w}_{i,j}\exp{(\hat{\boldsymbol{z}}_{i}^{1\mathsf{T}}\hat{% \boldsymbol{z}}_{j}^{2}/\tau)}}},\end{split}start_ROW start_CELL - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_exp ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT / italic_τ ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_exp ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_τ ) end_ARG , end_CELL end_ROW (11)

where τ𝜏\tauitalic_τ is the temperature hyper-parameter, 𝒛^isubscript^𝒛𝑖\hat{\boldsymbol{z}}_{i}over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the normalized feature projected by projector g𝑔gitalic_g, where 𝒛^i=g(𝒙^i)/g(𝒙^i)subscript^𝒛𝑖𝑔subscript^𝒙𝑖norm𝑔subscript^𝒙𝑖\hat{\boldsymbol{z}}_{i}=g(\hat{\boldsymbol{x}}_{i})/\|g(\hat{\boldsymbol{x}}_% {i})\|over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ∥ italic_g ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥. w^i,j=(1𝒑^i𝖳𝒑^j)subscript^𝑤𝑖𝑗1superscriptsubscript^𝒑𝑖𝖳subscript^𝒑𝑗\hat{w}_{i,j}=(1-\hat{\boldsymbol{p}}_{i}^{\mathsf{T}}\hat{\boldsymbol{p}}_{j})over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( 1 - over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a gate coefficient, which filters the negative samples that belong to same cluster as 𝒙^isubscript^𝒙𝑖\hat{\boldsymbol{x}}_{i}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

In pre-training step, due to the lack of cluster information, 𝑷^^𝑷\hat{\boldsymbol{P}}over^ start_ARG bold_italic_P end_ARG is set to the uniform, p^i,j=1/Ksubscript^𝑝𝑖𝑗1𝐾\hat{p}_{i,j}=1/Kover^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 / italic_K, i,jfor-all𝑖𝑗\forall i,j∀ italic_i , italic_j. And DPAC is pre-trained by the pairwise contrastive loss: 12[(𝑿^1,𝑿^2,𝑷^)+(𝑿^2,𝑿^1,𝑷^)]12delimited-[]superscript^𝑿1superscript^𝑿2^𝑷superscript^𝑿2superscript^𝑿1^𝑷\frac{1}{2}[\ell(\hat{\boldsymbol{X}}^{1},\hat{\boldsymbol{X}}^{2},\hat{% \boldsymbol{P}})+\ell(\hat{\boldsymbol{X}}^{2},\hat{\boldsymbol{X}}^{1},\hat{% \boldsymbol{P}})]divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) + roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) ]. Then in clustering step, the whole model is updated by minimizing the sum of contrastive and clustering loss:

min𝜽f,g,h12[(𝑿^1,𝑿^2,𝑷^)+(𝑿^2,𝑿^1,𝑷^)]+KL(𝑸𝑷^1),subscriptsubscript𝜽𝑓𝑔12delimited-[]superscript^𝑿1superscript^𝑿2^𝑷superscript^𝑿2superscript^𝑿1^𝑷𝐾𝐿conditional𝑸superscript^𝑷1\min\limits_{\boldsymbol{\theta}_{f,g,h}}\frac{1}{2}[\ell(\hat{\boldsymbol{X}}% ^{1},\hat{\boldsymbol{X}}^{2},\hat{\boldsymbol{P}})+\ell(\hat{\boldsymbol{X}}^% {2},\hat{\boldsymbol{X}}^{1},\hat{\boldsymbol{P}})]+KL(\boldsymbol{Q}\parallel% \hat{\boldsymbol{P}}^{1}),roman_min start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_f , italic_g , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) + roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) ] + italic_K italic_L ( bold_italic_Q ∥ over^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , (12)
min𝜽f,g,h12[(𝑿^1,𝑿^2,𝑷^)+(𝑿^2,𝑿^1,𝑷^)]+1NTr(𝑷^1𝖳𝑫~x^𝑷^1),subscriptsubscript𝜽𝑓𝑔12delimited-[]superscript^𝑿1superscript^𝑿2^𝑷superscript^𝑿2superscript^𝑿1^𝑷1𝑁𝑇𝑟superscript^𝑷1𝖳subscript~𝑫^𝑥superscript^𝑷1\min\limits_{\boldsymbol{\theta}_{f,g,h}}\frac{1}{2}[\ell(\hat{\boldsymbol{X}}% ^{1},\hat{\boldsymbol{X}}^{2},\hat{\boldsymbol{P}})+\ell(\hat{\boldsymbol{X}}^% {2},\hat{\boldsymbol{X}}^{1},\hat{\boldsymbol{P}})]+\frac{1}{N}Tr(\hat{% \boldsymbol{P}}^{1\mathsf{T}}\widetilde{\boldsymbol{D}}_{\hat{x}}\hat{% \boldsymbol{P}}^{1}),roman_min start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_f , italic_g , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) + roman_ℓ ( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_P end_ARG ) ] + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_T italic_r ( over^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT 1 sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_D end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT over^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , (13)

where 𝜽f,g,hsubscript𝜽𝑓𝑔\boldsymbol{\theta}_{f,g,h}bold_italic_θ start_POSTSUBSCRIPT italic_f , italic_g , italic_h end_POSTSUBSCRIPT are the parameters of the neural network, classifier, and projector, respectively. Eq. 12 is the deep clustering method based on OPA mentioned in Sec. 3.4. Eq. 13 is the deep clustering method that directly minimizes Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT in Sec. 3.1. The overall training procedure is shown in Algorithm 3. Moreover, for fair comparison in subsequent experiments, we also implement a self-labeling fine-tuning operation as [36, 47] to further improve the clustering performance.

1 Input: image set 𝑰𝑰\boldsymbol{I}bold_italic_I; clustering epochs E𝐸Eitalic_E; batch size B𝐵Bitalic_B; weighting exponent m𝑚mitalic_m.
2 for epoch1𝑒𝑝𝑜𝑐1epoch\leftarrow 1italic_e italic_p italic_o italic_c italic_h ← 1 to E𝐸Eitalic_E do
3       Sample a mini-batch {𝑰i}i=1Bsuperscriptsubscriptsubscript𝑰𝑖𝑖1𝐵\{\boldsymbol{I}_{i}\}_{i=1}^{B}{ bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and conduct augmentations{𝑰i1,𝑰i2}i=1Bsuperscriptsubscriptsuperscriptsubscript𝑰𝑖1superscriptsubscript𝑰𝑖2𝑖1𝐵\{\boldsymbol{I}_{i}^{1},\boldsymbol{I}_{i}^{2}\}_{i=1}^{B}{ bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT;
4       Get {𝒙^i,𝒙^i1,𝒙^i2,𝒑^i,𝒑^i1}i=1Bsuperscriptsubscriptsubscript^𝒙𝑖superscriptsubscript^𝒙𝑖1superscriptsubscript^𝒙𝑖2subscript^𝒑𝑖superscriptsubscript^𝒑𝑖1𝑖1𝐵\{\hat{\boldsymbol{x}}_{i},\hat{\boldsymbol{x}}_{i}^{1},\hat{\boldsymbol{x}}_{% i}^{2},\hat{\boldsymbol{p}}_{i},\hat{\boldsymbol{p}}_{i}^{1}\}_{i=1}^{B}{ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT through forward propagation;
5       if choose OPA as optimal object then
6            Compute clustering codes {𝒒i}i=1Bsuperscriptsubscriptsubscript𝒒𝑖𝑖1𝐵\{\boldsymbol{q}_{i}\}_{i=1}^{B}{ bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT by Algorithm 2 with {𝒙^i,𝒑^i}i=1Bsuperscriptsubscriptsubscript^𝒙𝑖subscript^𝒑𝑖𝑖1𝐵\{\hat{\boldsymbol{x}}_{i},\hat{\boldsymbol{p}}_{i}\}_{i=1}^{B}{ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT;
7             Compute overall loss \mathcal{L}caligraphic_L by Eq. 12 with {𝒙^i1,𝒙^i2,𝒑^i,𝒑^i1,𝒒i}i=1Bsuperscriptsubscriptsuperscriptsubscript^𝒙𝑖1superscriptsubscript^𝒙𝑖2subscript^𝒑𝑖superscriptsubscript^𝒑𝑖1subscript𝒒𝑖𝑖1𝐵\{\hat{\boldsymbol{x}}_{i}^{1},\hat{\boldsymbol{x}}_{i}^{2},\hat{\boldsymbol{p% }}_{i},\hat{\boldsymbol{p}}_{i}^{1},\boldsymbol{q}_{i}\}_{i=1}^{B}{ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ;
8       end if
9      if choose Jpacsubscript𝐽𝑝𝑎𝑐J_{pac}italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT as optimal object then
10            Compute overall loss \mathcal{L}caligraphic_L by Eq. 13 with {𝒙^i1,𝒙^i2,𝒑^i,𝒑^i1}i=1Bsuperscriptsubscriptsuperscriptsubscript^𝒙𝑖1superscriptsubscript^𝒙𝑖2subscript^𝒑𝑖superscriptsubscript^𝒑𝑖1𝑖1𝐵\{\hat{\boldsymbol{x}}_{i}^{1},\hat{\boldsymbol{x}}_{i}^{2},\hat{\boldsymbol{p% }}_{i},\hat{\boldsymbol{p}}_{i}^{1}\}_{i=1}^{B}{ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT;
11       end if
12      Update 𝜽fsubscript𝜽𝑓\boldsymbol{\theta}_{f}bold_italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, 𝜽gsubscript𝜽𝑔\boldsymbol{\theta}_{g}bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, 𝜽hsubscript𝜽\boldsymbol{\theta}_{h}bold_italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT through gradient descent to minimize \mathcal{L}caligraphic_L;
13 end for
14Output: Deep clustering model hf𝑓h\circ fitalic_h ∘ italic_f
Algorithm 3 Training algorithm for DPAC

4 Experiment

4.0.1 Dataset:

Four real-world datasets and five widely used natural image datasets are involved to evaluate the clustering ability of PAC and DPAC. The details of the datasets are summarized in the Tab. 1. For CIFAR-100, we used its 20 super-classes rather than 100 classes as the ground truth. For STL-10, its 100,000 unlabeled images are additionally used in the pre-training step of DPAC. ImageNet-10 and ImageNet-Dogs are subsets of ImageNet-1k. Clustering accuracy (ACC), normalized mutual information (NMI), and adjusted random index (ARI) are adopted to compare the clustering results.

Table 1: Dataset settings for our experiments.
Dataset Sample Class Size Dataset Sample Class Dimension
CIFAR-10 [32] 60,000 10 32×\times×32 Coil-100 [39] 7,200 100 49,152
CIFAR-100 [32] 60,000 20 32×\times×32 Isolet [16] 7,796 26 617
STL-10 [15] 13,000 10 224×\times×224 Pendigits [2] 10,992 10 16
ImageNet-10 [12] 13,000 10 224×\times×224 MNIST [19] 10,000 10 784
ImageNet-Dogs [12] 19,500 15 224×\times×224

4.1 Probability Aggregation Clustering

4.1.1 Hyperparameter and Method Setting

The effectiveness of the proposed PAC is verified by comparing it with multiple clustering methods on nine datasets. The m𝑚mitalic_m of PAC is set to 1.03 for all datasets. The threshold value of RCC [45] is set to 1. The weighting exponent m𝑚mitalic_m of FCM is set to 1.1 for real-world datasets and 1.05 for natural image datasets. We predefine K𝐾Kitalic_K for all algorithms except FINCH [44]. All algorithms are initialized randomly and run 10 times. The mean and variance of 10 times run are taken as comparison results.

4.1.2 Algorithm Scalability

The clustering results of the real-world datasets, which consist of samples with varying numbers, classes, and dimensions, are summarized in Tab. 2. PAM and RCC time out due to the high dimensionality of Coil-100. PAC outperforms all the compared clustering algorithms on Coil-100 and Isolet but is not as effective as RCC on Mnist and Pendigit, which is specially designed for entangled data. The robustness and performance of PAC surpass center-based methods by a large margin. Moreover, we also provide the clustering results on neural network feature data in Tab. 3 to explore the ability of PAC to handle data extracted by neural networks. RCC experience extreme performance degradation on neural network extracted data, so we exclude it from the comparison. PAC also performs well in processing neural network data. The improvement is not significant in CIFAR-100 and ImageNet-Dogs. One possible explanation is that these datasets give subtle differences in object classes, causing the pretrained representations to be indistinguishable.

Table 2: Clustering results (Avg±plus-or-minus\pm±Std) and average time (s) of PAC on real-world datasets. The best and second-best results are shown in bold and underlined, respectively. Metric: ACC (%).
Method Coil-100 Isolet Pendigits MNIST Average Time
KM [53] 56.4±plus-or-minus\pm±1.7 52.7±plus-or-minus\pm±4.5 67.0±plus-or-minus\pm±4.7 53.0±plus-or-minus\pm±3.6 98.1 0.2 0.05 0.07
PAM [33] N/A 55.5±plus-or-minus\pm±0.0 75.6±plus-or-minus\pm±2.5 47.2±plus-or-minus\pm±1.7 N/A 341.9 141.6 124.0
FCM [6] 61.6±plus-or-minus\pm±1.2 55.8±plus-or-minus\pm±2.3 70.5±plus-or-minus\pm±2.1 56.6±plus-or-minus\pm±2.6 2001.5 8.6 0.9 0.6
SC [51] 58.2±plus-or-minus\pm±0.7 53.5±plus-or-minus\pm±2.5 62.4±plus-or-minus\pm±4.2 54.6±plus-or-minus\pm±2.2 11.7 3.4 5.8 6.2
SPKF [27] 59.7±plus-or-minus\pm±1.3 55.2±plus-or-minus\pm±2.0 71.4±plus-or-minus\pm±4.4 53.9±plus-or-minus\pm±2.7 101.6 0.6 0.07 0.2
RCC [45] N/A 15.3±plus-or-minus\pm±0.0 79.6±plus-or-minus\pm±0.0 65.7±plus-or-minus\pm±0.0 N/A 122.8 6.9 6.9
FINCH [44] 56.4±plus-or-minus\pm±0.0 47.5±plus-or-minus\pm±0.0 62.7±plus-or-minus\pm±0.0 57.9±plus-or-minus\pm±0.0 15.1 0.5 0.05 0.05
PAC 65.1±plus-or-minus\pm±1.5 61.8±plus-or-minus\pm±0.0 78.0±plus-or-minus\pm±0.0 59.7±plus-or-minus\pm±3.6 5179.0 249.6 153.6 423.4
Table 3: Clustering results (Avg±plus-or-minus\pm±Std) of PAC on deep features. Metric: ACC (%).
Method CIFAR-10 CIFAR-100 STL-10 ImageNet-10 ImageNet-Dogs
KM [53] 76.8±plus-or-minus\pm±6.8 41.8±plus-or-minus\pm±1.7 66.8±plus-or-minus\pm±4.3 76.8±plus-or-minus\pm±6.8 41.8±plus-or-minus\pm±1.7
PAM [33] 77.8±plus-or-minus\pm±2.5 41.0±plus-or-minus\pm±1.1 64.3±plus-or-minus\pm±4.8 79.9±plus-or-minus\pm±4.6 52.6±plus-or-minus\pm±3.1
FCM [6] 75.9±plus-or-minus\pm±2.1 42.3±plus-or-minus\pm±0.7 66.6±plus-or-minus\pm±4.7 75.9±plus-or-minus\pm±2.1 42.3±plus-or-minus\pm±0.7
SC [51] 83.5±plus-or-minus\pm±0.0 40.0±plus-or-minus\pm±1.1 63.8±plus-or-minus\pm±2.9 82.9±plus-or-minus\pm±1.3 47.6±plus-or-minus\pm±1.4
SPKF [27] 75.9±plus-or-minus\pm±5.7 42.9±plus-or-minus\pm±1.9 65.8±plus-or-minus\pm±5.5 80.6±plus-or-minus\pm±7.6 49.1±plus-or-minus\pm±3.8
FINCH [44] 49.2±plus-or-minus\pm±0.0 32.0±plus-or-minus\pm±0.0 42.9±plus-or-minus\pm±0.0 52.6±plus-or-minus\pm±0.0 43.8±plus-or-minus\pm±0.0
PAC 87.1±plus-or-minus\pm±0.0 43.8±plus-or-minus\pm±0.7 74.9±plus-or-minus\pm±2.6 95.8±plus-or-minus\pm±0.0 47.3±plus-or-minus\pm±3.9

4.1.3 Parameter Sensibility Analysis

We evaluate the parameter sensitivity of m𝑚mitalic_m for both FCM and PAC on Pendigits. Fig. 1 reports the average ACC for different m𝑚mitalic_m. It was indicated that in comparison to FCM, PAC has a narrower optimal range of m𝑚mitalic_m and smaller results variance, which is not sensitive to parameter m𝑚mitalic_m.

Refer to caption
Figure 1: The effect of weighting exponent m𝑚mitalic_m in PAC and FCM.

4.1.4 Time Complexity Analysis

The average calculation time for each algorithm is listed in Tab. 2. The computational complexity of PAC is analyzed in this section. It takes 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ) time to calculate jipj,kd~i,jsubscript𝑗𝑖subscript𝑝𝑗𝑘subscript~𝑑𝑖𝑗\sum_{j\neq i}p_{j,k}\tilde{d}_{i,j}∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in Eq. 8. And PAC updates entire 𝑷𝑷\boldsymbol{P}bold_italic_P by NK𝑁𝐾NKitalic_N italic_K iterations. So the time complexity PAC is 𝒪(N2K)𝒪superscript𝑁2𝐾\mathcal{O}(N^{2}K)caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K ), which is the square complexity.

4.2 Deep Probability Aggregation Clustering

Table 4: Performance comparison of deep clustering methods on five benchmarks. The best and second-best results are shown in bold and underlined, respectively. Metrics: NMI / ACC / ARI (%). Temu incorporates extra ImageNet-1k data to pretrain the model, so we exclude it in comparison. 1 denotes online deep clustering methods, while 2 denotes offline deep clustering methods. Cluster const. denotes cluster size constraint.
Method Cluster CIFAR-10 CIFAR-100 STL-10 ImageNet-10 ImageNet-Dogs
const. NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI NMI ACC ARI
PICA1 [28] 59.1 69.6 51.2 31.0 33.7 17.1 61.1 71.3 53.1 80.2 87.0 76.1 35.2 35.2 20.1
PCL2 [34] 80.2 87.4 76.6 52.8 52.6 36.3 41.0 71.8 67.0 84.1 90.7 82.2 44.0 41.2 29.9
IDFD2 [48] 71.1 81.5 66.3 42.6 42.5 26.4 64.3 75.6 57.5 89.8 95.4 90.1 54.6 59.1 41.3
NNM1 [18] 74.8 84.3 70.9 48.4 47.7 31.6 69.4 80.8 65.0 - - - - - -
CC1 [35] 70.5 79.0 63.7 43.1 42.9 26.6 76.4 85.0 72.6 85.9 89.3 82.2 44.5 42.9 27.4
GCC1 [58] 76.4 85.6 72.8 47.2 47.2 30.5 68.4 78.8 63.1 84.2 90.1 82.2 49.0 52.6 36.2
TCC1 [46] 79.0 90.6 73.3 47.9 49.1 31.2 73.2 81.4 68.9 84.8 89.7 82.5 55.4 59.5 41.7
SPICE1 [40] 73.4 83.8 70.5 44.8 46.8 29.4 81.7 90.8 81.2 82.8 92.1 83.6 57.2 64.6 47.9
SeCu1 [41] 79.9 88.5 78.2 51.6 51.6 36.0 70.7 81.4 65.7 - - - - - -
Temi2 [1] 82.9 90.0 80.7 59.8 57.8 42.5 93.6 96.7 93.0 - - - - - -
DPACJpac1superscriptsubscriptabsent1subscript𝐽𝑝𝑎𝑐{}_{1}^{J_{pac}}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT(Eq. 13) 81.2 89.0 79.1 48.3 50.2 34.4 81.8 89.7 80.0 90.1 96.0 91.1 51.9 53.9 38.9
DPACopa1superscriptsubscriptabsent1𝑜𝑝𝑎{}_{1}^{opa}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_p italic_a end_POSTSUPERSCRIPT(Eq. 12) 82.7 90.7 81.2 52.9 51.6 36.2 84.5 92.6 84.7 90.8 96.2 91.8 60.2 65.5 50.0
With self-labeling fine-tuning (†):
SCAN2[50] 79.7 88.3 77.2 48.6 50.7 33.3 69.8 80.9 64.6 - - - - - -
SPICE1[40] 86.5 92.6 85.2 56.7 53.8 38.7 87.2 93.8 87.0 90.2 95.9 91.2 62.7 67.5 52.6
TCL1[36] 81.9 88.7 78.0 52.9 53.1 35.7 79.9 86.8 75.7 87.5 89.5 83.7 62.3 64.4 51.6
SeCu1[41] 86.1 93.0 85.7 55.2 55.1 39.7 73.3 83.6 69.3 - - - - - -
DPACopa1superscriptsubscriptabsent1𝑜𝑝𝑎{}_{1}^{opa}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_p italic_a end_POSTSUPERSCRIPT 87.0 93.4 86.6 54.2 55.5 39.3 86.3 93.4 86.1 92.5 97.0 93.5 66.7 72.6 59.8

4.2.1 Implementation Details

ResNet-34 [26] is used as the backbone network in DPAC to ensure a fair comparison. We employed the architecture of SimCLR [14] with an MLP clustering classifier as model architecture. DPAC incorporates the image transformation of SimCLR as one view of augmentation and randomly selects four transformations from Rand Augment [17] as another view of augmentation. We maintain a consistent set of hyperparameters (m=1.03,τ=0.5formulae-sequence𝑚1.03𝜏0.5m=1.03,\tau=0.5italic_m = 1.03 , italic_τ = 0.5) across all amounts of benchmarks. The model is trained for 1,000 epochs in the pre-training step and 200 epochs in the clustering step. As for self-labeling fine-tuning, we utilize a linear classifier and train the model as [36]. The thresholds are set to 0.95 for each dataset to select sufficient pseudo labels from clustering classifier outputs. Adam [31] with a constant learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a weight decay of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT was employed. The batch size is set as 240 and the experiments are implemented on a single NVIDIA 4090 24G GPU.

4.2.2 Comparison with State of the Arts

The comparison of DPAC is presented in Tab. 4, where methods with additional cluster size constraints are marked. We have the following observations: (1) DPAC significantly surpasses the performance of SimCLR+PAC in Tab. 3 across all benchmarks. The accuracy of DPAC exceeds PAC by more than 10% on CIFAR-100, STL-10, and ImageNet-Dogs benchmarks, which demonstrates the semantic learning ability of DPAC. (2) Compared with DPACJpacsubscript𝐽𝑝𝑎𝑐{}^{J_{pac}}start_FLOATSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT, DPACopa has better performance. We attribute this to the fact that the self-labeling manner of OPA alleviates the intrinsic bias brought by the objective function of feature clustering. (3) Compared with deep clustering methods with offline K-means, such as IDFD [48] and PCL [34], DPAC has superior performance on all benchmarks due to the stable learning offered by the online manner. (4) Compared with online contrastive clustering methods CC [35], TCC [46], and TCL [36], DPAC incorporates global spatial information to achieve a fine-grained partitioning of cluster boundaries. (5) Compared with balanced clustering methods and minimal cluster size constraint SeCU [36], DPAC omits clustering regular term, is more concise, and outputs more flexible cluster assignments. (6) DPACopa† demonstrates the remarkable extensibility of our approach, showcasing the potential for integration with diverse deep modules.

Table 5: Further analysis for DPAC.
(a) Comparison of different contrastive framework on CIFAR-10.
Method ACC
SimCLR+OPA 89.7
MoCo+OPA 86.5
DPACopa 90.8
(b) Comparison of AE based clustering methods on MNIST.
Method NMI ACC
DEC [55] 86.7 88.1
IDEC [24] 86.7 88.1
EDESC [8] 86.2 91.3
SSC [52] 95.0 98.2
AE+OPA 90.3 95.4
(c) Effect of clustering regularization (CR) term on CIFAR-10. Metric: ACC (%).
Method w/ CR w/o CR
SCAN [50] 85.7 0.1
CC [35] 79.2 68.7
GCC [58] 85.6 68.0
DPACJpacsubscript𝐽𝑝𝑎𝑐{}^{J_{pac}}start_FLOATSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT 88.0 89.0
DPACopa 0.1 90.7

4.2.3 Contrastive Framework Analysis

We further analyze our DPAC model from different perspectives. We study the effect of proposed contrastive learning. We replace the weighted contrastive loss in Eq. 13 with standard contrastive loss, and denote it as SimCLR+OPA. Besides, we also perform OPA based on MoCo [25]. Conventional contrastive loss treats corresponding augmented samples as positive pairs and others as negative pairs, which ignores the latent semantic structure between negative pairs, leading to the class collision issue [54]. Tab. 5(a) illustrates our weighted contrastive loss alleviates the cluster collision problem and encodes cluster knowledge into contrastive representation learning.

4.2.4 Pretext Task Analysis

We study the effect of different pretext tasks combined with DPAC. The autoencoder (AE) is used as architecture to prove the universality of our module. The clustering results on MNIST are shown in Tab. 5(b), which demonstrates that OPA can combine with other self-supervised approaches. Especially, compared with center-based IDEC [55] and SSC [52], our OPA does not require K-means to initialize cluster layer and has higher scalability.

4.2.5 Balanced Constraint Analysis

We study the impact of balanced constraints in different deep clustering methods. Most existing online deep clustering methods [46, 35, 58] introduce an average entropy as clustering regularization (CR) term to balance the cluster distribution. The clustering regularization experiments are shown in Tab. 5(c). SCAN classifies all samples into a single cluster, and CC and GCC descend into a suboptimal solution without (w/o) the CR term. Besides, if the CR term is too large in the total loss, it will affect the clustering performance in these methods. It is noteworthy that DPAC avoids crashes without the CR term. The performance of DPACJpacsubscript𝐽𝑝𝑎𝑐{}^{J_{pac}}start_FLOATSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_p italic_a italic_c end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT with (w/) CR term becomes worse. It demonstrates the superiority of unconstrained clustering, that is, no trade-off between trivial solutions and performance. And DPACopa with CR term yields a uniform distribution with no predictive effect. The reason that the constraint of the CR term is too strong so that the classifier cannot accumulate optimization enough information for OPA.

4.2.6 Hyperparameter Analysis

As listed in Algorithm 2, weight exponent m𝑚mitalic_m is the key hyperparameter for OPA, α=1/(m1)𝛼1𝑚1\alpha=1/(m-1)italic_α = 1 / ( italic_m - 1 ) is the power of 𝒔i,ksubscript𝒔𝑖𝑘\boldsymbol{s}_{i,k}bold_italic_s start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT that amplifies clustering score in Eq. 9 to become sharper to obtain distinguishable cluster assignments. The larger m𝑚mitalic_m becomes, the smaller the sharpening effect is, so the model tends to uniform assignments, and clustering may fail due to insufficient scaling. The performance of OPA with different m𝑚mitalic_m settings is evaluated in Tab. 6. As features become more and more inseparable, the optimal range of m𝑚mitalic_m narrows. Therefore, we suggest setting m𝑚mitalic_m close to 1 to obtain a universal hyperparameter setting (m=1.03𝑚1.03m=1.03italic_m = 1.03 for all datasets).

Table 6: Hyperparameter analysis of exponent m𝑚mitalic_m in OPA. Metrics:ACC (%).
Weight exponent m𝑚mitalic_m 1.01 1.04 1.07 1.1 1.13 1.16 1.2
α=1/(m1)𝛼1𝑚1\alpha=1/(m-1)italic_α = 1 / ( italic_m - 1 ) 100.0 25.0 14.3 10.0 7.7 6.3 5.0
CIFAR-10 90.8 90.3 90.1 89.3 89.3 89.3 10.0
STL-10 92.4 92.1 92.0 91.8 90.7 10.0 10.0
CIFAR-100 50.2 51.1 51.0 5.0 5.0 5.0 5.0

4.2.7 Superiority of Online Clustering

We perform the offline clustering version of DPAC to facilitate a comparative analysis between online and offline clustering strategies. We adopt KM, FCM, and PAC to compute offline codes of all samples for Eq. 10 every 1, 10, and 200 epochs. The performance and training duration are reported in Tab. 7. It can be observed that the performance of KM and FCM gradually deteriorates as the update frequency decreases, whereas DPACopa exhibits superior performance and lower time complexity.

We recorded accumulated errors during DPAC + offline PAC training progress to analyze the error accumulation issue. Offline PAC was conducted every 10 epochs. As depicted in Fig. 2, errors (network classifies correctly while offline clustering classifies incorrectly) are introduced by offline clustering every 10 epochs and continue to accumulate through the training process. It demonstrates our OPA module effectively mitigates performance degeneration and error accumulation issues to perform stable and efficient clustering.

Table 7: The comparison of online and offline DPAC on STL-10. Metrics: Hour/ACC (%).
Method Number of Offline Clustering Runs
200 20 1
DPAC + offline KM 6.3 / 73.7 3.0 / 72.0 2.0 / 69.3
DPAC + offline FCM 8.5 / 78.7 3.3 / 77.5 2.0 / 68.4
DPAC + offline PAC 52.2 / 83.7 6.7 / 87.5 2.4 / 81.3
DPACopa 2.0 / 92.6
Refer to caption
Figure 2: Training process and error accumulation of online and offline DPAC on STL-10.

5 Conclusion

A novel machine clustering method PAC without cluster center was proposed from a very new perspective, which addresses the shortcomings of center-based clustering approaches and is well-suited for integration with deep models. A theoretical model and an elegant iterative optimization solution for PAC have been developed. PAC implements clustering through sample probability aggregation, which makes part samples based calculation possible. Therefore, an online deep clustering framework DPAC has been developed, which has no constraints on cluster size and can perform more flexible clustering. Experiments on several benchmarks verified the effectiveness of our proposal.

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. U22B2036).

References

  • [1] Adaloglou, N., Michels, F., Kalisch, H., Kollmann, M.: Exploring the limits of deep image clustering using pretrained models. arXiv preprint arXiv:2303.17896 (2023)
  • [2] Alimoglu, F., Alpaydin, E.: Combining multiple representations and classifiers for pen-based handwritten digit recognition. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. vol. 2, pp. 637–640. IEEE (1997)
  • [3] Arabie, P., Hubert, J., Soete, D.: Complexity theory: An introduction. Clustering and classification p. 199 (1996)
  • [4] Arthur, D., Vassilvitskii, S., et al.: k-means++: The advantages of careful seeding. In: Soda. vol. 7, pp. 1027–1035 (2007)
  • [5] Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371 (2019)
  • [6] Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy c-means clustering algorithm. Computers & geosciences 10(2-3), 191–203 (1984)
  • [7] Bradley, P.S., Bennett, K.P., Demiriz, A.: Constrained k-means clustering. Microsoft Research, Redmond 20(0),  0 (2000)
  • [8] Cai, J., Fan, J., Guo, W., Wang, S., Zhang, Y., Zhang, Z.: Efficient deep embedded subspace clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1–10 (2022)
  • [9] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision (ECCV). pp. 132–149 (2018)
  • [10] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33, 9912–9924 (2020)
  • [11] Celebi, M.E.: Partitional clustering algorithms. Springer (2014)
  • [12] Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C.: Deep adaptive image clustering. In: Proceedings of the IEEE international conference on computer vision. pp. 5879–5887 (2017)
  • [13] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)
  • [14] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems 33, 22243–22255 (2020)
  • [15] Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
  • [16] Cole, R., Muthusamy, Y., Fanty, M.: The ISOLET spoken letter database. Oregon Graduate Institute of Science and Technology, Department of Computer … (1990)
  • [17] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 702–703 (2020)
  • [18] Dang, Z., Deng, C., Yang, X., Wei, K., Huang, H.: Nearest neighbor matching for deep clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13693–13702 (2021)
  • [19] Deng, L.: The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine 29(6), 141–142 (2012)
  • [20] Deshmukh, A.A., Regatti, J.R., Manavoglu, E., Dogan, U.: Representation learning for clustering via building consensus. Machine Learning 111(12), 4601–4638 (2022)
  • [21] Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd. vol. 96, pp. 226–231 (1996)
  • [22] Fränti, P., Sieranoja, S.: How much can k-means be improved by using better initialization and repeats? Pattern Recognition 93, 95–112 (2019)
  • [23] Frey, B.J., Dueck, D.: Clustering by passing messages between data points. science 315(5814), 972–976 (2007)
  • [24] Guo, X., Gao, L., Liu, X., Yin, J.: Improved deep embedded clustering with local structure preservation. In: Ijcai. vol. 17, pp. 1753–1759 (2017)
  • [25] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
  • [26] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [27] Hornik, K., Feinerer, I., Kober, M., Buchta, C.: Spherical k-means clustering. Journal of statistical software 50, 1–22 (2012)
  • [28] Huang, J., Gong, S., Zhu, X.: Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8849–8858 (2020)
  • [29] Huang, Z., Chen, J., Zhang, J., Shan, H.: Learning representation for clustering via prototype scattering and positive sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(6), 7509–7524 (2022)
  • [30] Jiao, Y., Xie, N., Gao, Y., Wang, C.C., Sun, Y.: Fine-grained fashion representation learning by online deep clustering. In: European Conference on Computer Vision. pp. 19–35. Springer (2022)
  • [31] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [32] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
  • [33] Van der Laan, M., Pollard, K., Bryan, J.: A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation 73(8), 575–584 (2003)
  • [34] Li, J., Zhou, P., Xiong, C., Hoi, S.C.: Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966 (2020)
  • [35] Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 8547–8555 (2021)
  • [36] Li, Y., Yang, M., Peng, D., Li, T., Huang, J., Peng, X.: Twin contrastive learning for online clustering. International Journal of Computer Vision 130(9), 2205–2221 (2022)
  • [37] Lin, W.C., Ke, S.W., Tsai, C.F.: Cann: An intrusion detection system based on combining cluster centers and nearest neighbors. Knowledge-based systems 78, 13–21 (2015)
  • [38] Nassar, I., Hayat, M., Abbasnejad, E., Rezatofighi, H., Haffari, G.: Protocon: Pseudo-label refinement via online clustering and prototypical consistency for efficient semi-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11641–11650 (2023)
  • [39] Nene, S.A., Nayar, S.K., Murase, H., et al.: Columbia object image library (coil-20) (1996)
  • [40] Niu, C., Shan, H., Wang, G.: Spice: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing 31, 7264–7278 (2022)
  • [41] Qian, Q.: Stable cluster discrimination for deep clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16645–16654 (2023)
  • [42] Qian, Q., Xu, Y., Hu, J., Li, H., Jin, R.: Unsupervised visual representation learning by online constrained k-means. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16640–16649 (2022)
  • [43] Ronen, M., Finder, S.E., Freifeld, O.: Deepdpm: Deep clustering with an unknown number of clusters. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9861–9870 (2022)
  • [44] Sarfraz, S., Sharma, V., Stiefelhagen, R.: Efficient parameter-free clustering using first neighbor relations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8934–8943 (2019)
  • [45] Shah, S.A., Koltun, V.: Robust continuous clustering. Proceedings of the National Academy of Sciences 114(37), 9814–9819 (2017)
  • [46] Shen, Y., Shen, Z., Wang, M., Qin, J., Torr, P., Shao, L.: You never cluster alone. Advances in Neural Information Processing Systems 34, 27734–27746 (2021)
  • [47] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems 33, 596–608 (2020)
  • [48] Tao, Y., Takagi, K., Nakata, K.: Clustering-friendly representation learning via instance discrimination and feature decorrelation. arXiv preprint arXiv:2106.00131 (2021)
  • [49] Tsai, T.W., Li, C., Zhu, J.: Mice: Mixture of contrastive experts for unsupervised image clustering. In: International conference on learning representations (2021)
  • [50] Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., Van Gool, L.: Scan: Learning to classify images without labels. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X. pp. 268–285. Springer (2020)
  • [51] Von Luxburg, U.: A tutorial on spectral clustering. Statistics and computing 17, 395–416 (2007)
  • [52] Wang, H., Lu, N., Luo, H., Liu, Q.: Self-supervised clustering with assistance from off-the-shelf classifier. Pattern Recognition 138, 109350 (2023)
  • [53] Wang, J., Wang, J., Song, J., Xu, X.S., Shen, H.T., Li, S.: Optimized cartesian k-means. IEEE Transactions on Knowledge and Data Engineering 27(1), 180–192 (2014)
  • [54] Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning. pp. 9929–9939. PMLR (2020)
  • [55] Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. pp. 478–487. PMLR (2016)
  • [56] Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In: international conference on machine learning. pp. 3861–3870. PMLR (2017)
  • [57] Zhan, X., Xie, J., Liu, Z., Ong, Y.S., Loy, C.C.: Online deep clustering for unsupervised representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6688–6697 (2020)
  • [58] Zhong, H., Wu, J., Chen, C., Huang, J., Deng, M., Nie, L., Lin, Z., Hua, X.S.: Graph contrastive clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9224–9233 (2021)
  • [59] Zhong, S.: Efficient online spherical k-means clustering. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. vol. 5, pp. 3180–3185. IEEE (2005)
  • [60] Znalezniak, M., Rola, P., Kaszuba, P., Tabor, J., Śmieja, M.: Contrastive hierarchical clustering. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 627–643. Springer (2023)