Approximating Dasgupta Cost in Sublinear Time from a Few Random Seeds

Kapralov, Michael; Kumar, Akash; Lattanzi, Silvio; Mousavifar, Aida; Wrzos-Kaminska, Weronika

Abstract:Testing graph cluster structure has been a central object of study in property testing since the foundational work of Goldreich and Ron [STOC'96] on expansion testing, i.e. the problem of distinguishing between a single cluster (an expander) and a graph that is far from a single cluster. More generally, a $(k, \epsilon)$-clusterable graph $G$ is a graph whose vertex set admits a partition into $k$ induced expanders, each with outer conductance bounded by $\epsilon$. A recent line of work initiated by Czumaj, Peng and Sohler [STOC'15] has shown how to test whether a graph is close to $(k, \epsilon)$-clusterable, and to locally determine which cluster a given vertex belongs to with misclassification rate $\approx \epsilon$, but no sublinear time algorithms for learning the structure of inter-cluster connections are known. As a simple example, can one locally distinguish between the `cluster graph' forming a line and a clique?
In this paper, we consider the problem of testing the hierarchical cluster structure of $(k, \epsilon)$-clusterable graphs in sublinear time. Our measure of hierarchical clusterability is the well-established Dasgupta cost, and our main result is an algorithm that approximates Dasgupta cost of a $(k, \epsilon)$-clusterable graph in sublinear time, using a small number of randomly chosen seed vertices for which cluster labels are known. Our main result is an $O(\sqrt{\log k})$ approximation to Dasgupta cost of $G$ in $\approx n^{1/2+O(\epsilon)}$ time using $\approx n^{1/3}$ seeds, effectively giving a sublinear time simulation of the algorithm of Charikar and Chatziafratis [SODA'17] on clusterable graphs. To the best of our knowledge, ours is the first result on approximating the hierarchical clustering properties of such graphs in sublinear time.

Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:2207.02581 [cs.DS]
	(or arXiv:2207.02581v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2207.02581

Computer Science > Data Structures and Algorithms

Title:Approximating Dasgupta Cost in Sublinear Time from a Few Random Seeds

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators