0% found this document useful (0 votes)

56 views11 pages

ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm

This document summarizes a research paper that proposes an approach called A-BIRCH for automatically estimating the threshold parameter for the BIRCH clustering algorithm. A-BIRCH analyzes a small representative subset of the full dataset to extract attributes like cluster radius and minimum cluster distance. These attributes are then used to compute a threshold value that is likely to result in correct clustering when applied to the full dataset. The approach parallelizes the Gap Statistic method to analyze the representative subset efficiently and ensure the threshold estimation scales to large datasets. A-BIRCH aims to improve BIRCH's clustering quality and remove the need for specifying the expected number of clusters in advance.

Uploaded by

apriska ayu saputri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views11 pages

ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm

Uploaded by

apriska ayu saputri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308941857

A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm

Conference Paper in Advances in Intelligent Systems and Computing · October 2017

DOI: 10.1007/978-3-319-47898-2_18

CITATIONS READS
20 3,868

6 authors, including:

Boris Lorbeer Ana Kosareva

Technische Universität Berlin Technische Universität Berlin
8 PUBLICATIONS 134 CITATIONS 2 PUBLICATIONS 98 CITATIONS

SEE PROFILE SEE PROFILE

Bersant Deva Peter Ruppel

Technische Universität Berlin CODE University of Applied Sciences
16 PUBLICATIONS 239 CITATIONS 25 PUBLICATIONS 326 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Boris Lorbeer on 18 April 2018.

The user has requested enhancement of the downloaded file.

A-BIRCH: Automatic Threshold Estimation for
the BIRCH Clustering Algorithm

Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel
Küpper

Abstract Clustering algorithms are recently regaining attention with the availabil-
ity of large datasets and the rise of parallelized computing architectures. However,
most clustering algorithms do not scale well with increasing dataset sizes and re-
quire proper parametrization for correct results. In this paper we present A-BIRCH,
an approach for automatic threshold estimation for the BIRCH clustering algorithm
using Gap Statistic. This approach renders the global clustering step of BIRCH un-
necessary and does not require knowledge on the expected number of clusters be-
forehand. This is achieved by analyzing a small representative subset of the data to
extract attributes such as the cluster radius and the minimal cluster distance. These
attributes are then used to compute a threshold that results, with high probability, in
the correct clustering of elements. For the analysis of the representative subset we
parallelized Gap Statistic to improve performance and ensure scalability.

1 Introduction

Clustering is an unsupervised learning method that groups a set of given data points
into well separated subsets. Two prominent examples of clustering algorithms are
k-means [9] and the EM algorithm [4]. This paper addresses two issues with clus-
tering: (1) clustering algorithms usually do not scale well and (2) most algorithms
require the number of clusters (cluster count) as input. The first issue is becoming
more and more important. For applications that need to cluster, e.g., millions of doc-
uments, huge image or video databases, or terabytes of IoT sensor data, scalability

Boris Lorbeer, Ana Kosareva, Bersant Deva, Peter Ruppel, Axel Küpper
Service-centric Networking, Telekom Innovation Laboratories, Technische Universität Berlin
e-mail: {lorbeer|ana.kosareva|bersant.deva|peter.ruppel|axel.kuepper}@tu-berlin.de
Dženan Softić
Service-centric Networking, Telekom Innovation Laboratories, Technische Universität Berlin
e-mail: [email protected]

1
2 Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel Küpper

is essential. The second issue severely reduces the applicability of clustering in sit-
uations where the cluster count is very difficult to predict, such as data exploration,
feature engineering, and document clustering.
An important clustering method is BIRCH [17], which is one of the fastest clus-
tering algorithms available. It outperforms most of the other clustering algorithms
by up to two orders of magnitude. Thus, BIRCH already solves the first issue men-
tioned above. However, to achieve sufficient clustering quality, BIRCH requires the
cluster count as input, therefore failing to solve the second issue. This paper de-
scribes a method to use BIRCH without having to provide the cluster count, yet
preserving cluster quality and speed. We achieve this by omitting the global clus-
tering step and carefully choosing the threshold parameter of BIRCH. Following an
idea by Bach and Jordan [7], we propose to learn this parameter from representative
data. Our approach aims at datasets drawn from two-dimensional isotropic Gaussian
distributions which are typical when dealing with, for example, geospatial data.

2 Related Work

Clustering algorithms usually do not scale well, because often they have a complex-
ity of O(N 2 ) or O(NM), where N is the number of data points and M is the cluster
count. Scalability is typically achieved by parallelization of the algorithm in com-
pute clusters, e.g., Mahout’s k-means clustering [11] or Spark’s distributed versions
of k-means, EM clustering, and power iteration [10]. Other parallelization attempts
use the GPU. This has been done for k-means [16], EM clustering [8], and oth-
ers. The bottleneck here is the relatively slow connection between host and device
memory if the data does not fit into device memory.
The second issue we are concerned with is the identification of the cluster count.
A standard approach is to use one of the clustering algorithms that require the cluster
count to be input as a parameter, then run it for each count k inside an interval
of likely values. Then, the “elbow method” [14] is used to determine the optimal
number k. For probabilistic models, one can apply information criteria like AIC [1]
or BIC [13] to rate the different clustering results, see, for example, [18]. But all
those methods increase the computation time considerably, especially if there is not
enough prior information to keep the range of possible cluster counts small. Some
clustering algorithms find the number of clusters directly, without being required to
run the algorithm for all possible counts. Two of the more well-known examples are
DBSCAN [5] and Gap Statistic [15]. There are also some attempts to improve the
clustering quality of BIRCH by changing the algorithm itself, e.g. with non-constant
thresholds [6], with two different global thresholds [2], or by using DBSCAN on
each CF level to reduce noise [3]. However, while sometimes improving the quality,
those approaches slow BIRCH down and still require the cluster count as input.
A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm 3

3 BIRCH

We shortly describe BIRCH, mainly to fix notations. For details, see [17]. BIRCH
requires three parameters: the branching factor Br, the threshold T , and the cluster
count k. While the data points are entered into BIRCH, a height-balanced tree, the
CF tree, of hierarchical clusters is built. Each node contains the most important in-
formation of the belonging cluster, the cluster features (CF). From those, the cluster
n n
qC = 1/n ∑i xi , where {xi }i=1 are the elements of the cluster, and cluster radius
center
R = 1/n ∑ni xi2 can be computed for each cluster. Every new point starts at the root
and recursively walks down the tree entering the subcluster with the nearest center.
When adding a point at the leaves, a new cluster is created if the cluster radius R
increases beyond the threshold T , otherwise the point is added to the nearest cluster.
If the creation of a new cluster leads to more than Br child nodes of the parent, the
parent is split. To ensure that the tree stays balanced, the nodes above might need to
be split recursively. Once all points are submitted to BIRCH, the centers of the leaf
clusters are, in the global clustering phase, entered into k-means with the cluster
count k. This last step improves the cluster quality by merging neighboring clusters.
In this paper, we will refer to the BIRCH algorithm as full-BIRCH and to BIRCH
without its global clustering phase as tree-BIRCH.
Tree-BIRCH is very fast. It clusters 100,000 points into 1000 clusters in 4 sec-
onds, on a 2,9 GHz Intel Core i7, using scikit-learn [12]. The k-means implemen-
tation of the same library needs over two minutes to complete the same task on
the same architecture. Furthermore, tree-BIRCH doesn’t require the cluster count
as input, which in full-BIRCH is only needed for the global clustering phase. How-
ever, tree-BIRCH usually suffers from bad clustering quality. Therefore, this paper
focuses on improving the clustering quality of tree-BIRCH.

4 Concept

In the following, we present a method that automatically chooses an optimal thresh-

old parameter for tree-BIRCH. First, note that the CF-tree depends on the order in
which the data is entered. If the points of a single cluster are entered in the order
of increasing distance from the center, tree-BIRCH is more likely to return just one
cluster than if the first two points are from opposite sides of the cluster. In the lat-
ter case, the single cluster is likely to split, a situation we will refer to as cluster
splitting (Figure 1A). Next, consider two neighboring clusters. If the first two points
are from opposite clusters but still near each other, they could be collected into
the same cluster, given the threshold is large enough, which we refer to as cluster
combining (Figure 1B). Cluster combining often co-occurs with cluster splitting. To
reduce splitting of a single cluster, the threshold parameter of tree-BIRCH has to
be increased, whereas a decreased threshold parameter reduces cluster combining.
Datasets with a large ratio of cluster distance (the distance between the cluster cen-
4 Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel Küpper

A B C Cluster radius Cluster distance Cluster radius

Fig. 1: Cluster splitting (A), cluster combining where the combining cluster is cir-
cled (B), depiction of cluster radius and cluster distance (C). Different forms and
colors of the observations correspond to different clusters they belong to.

ters) to cluster radius are less likely to produce such errors (Figure 1C). In the next
section, we derive a condition for the error probability to be less than one percent.
Also, a formula for an appropriate threshold is provided. Note that there is, in fact,
another source of splitting. This happens if a cluster overlaps with two regions be-
longing to two non-leaf nodes in the CF-tree. Therefore, we choose the branching
factor Br to be larger than the highest possible cluster count.

5 Automatic Estimation of the BIRCH Threshold

In this paper it is presumed that the clusters are samples from two-dimensional
isotropic Gaussian distributions of roughly the same variance. To find conditions
for tree-BIRCH to work well, we first need to determine the common radius R of
the clusters and the minimum cluster distance Dmin . It is presumed that there exists a
small but representative subset of the data that has the same cluster radius and mini-
mum cluster distance Dmin as the full dataset. On this small dataset, Gap Statistic is
applied to obtain the cluster count k. This k in turn is given to k-means to produce a
clustering of the subset, which finally yields the two values R and Dmin .
For the determination of R and Dmin one could also use any other clustering al-
gorithm that finds the cluster centers and radii without requiring the cluster count
k as an input. However, Gap Statistic is chosen here, due to its high precision. Our
approach is heuristic. In each case, tree-BIRCH is run often enough to deduce es-
timates of cluster splitting and combining probabilities with sufficiently small 95%
confidence intervals (with 10,000 repetitions of each tree-BIRCH clustering, the
confidence interval of our error estimate at 0.01 is roughly 0.01 ± 0.002).

Avoiding Cluster Splitting

We create many clusters containing the same number of elements n by sampling
from a single isotropic two dimensional Gaussian probability density function. The
units are chosen such that the radius R of this cluster will be one. Then, tree-BIRCH
is applied with the same threshold T to each of those datasets. After each iteration,
A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm 5

we determine whether tree-BIRCH returns the correct number of clusters, namely

one. From this, we assess if the error probability estimate for tree-BIRCH is less
than 0.01, or one percent. We also investigate the impact of varying the number of
elements n in the cluster on the resulting error probability. Finally, we repeat all the
above for several thresholds T . For this heuristic analysis, we use the python library
scikit-learn and its implementation of BIRCH.
(a) (b)
threshold = 1.5 error = 2.1%
200000 195761 0.035
1.0
prob. of cluster counts

0.8 150000
0.6 found clusters
one three 100000 0.030
0.4 two
0.2 50000
0.0 4206 33 0.025
0
threshold = 1.6 error = 1.0%
200000 197957
1.0
prob. of cluster counts

0.8 150000 0.020

error probability
0.6 found clusters
one three 100000
0.4 two
0.2 50000 0.015
0.0 2037 6
0
threshold = 1.7 error = 0.5%
200000 199069
1.0 0.010
prob. of cluster counts

0.8 150000
0.6 found clusters
one three 100000 0.005
0.4 two
0.2 50000
0.0 929 2
0 0.000
2000 4000 6000 8000 10000 1 2 3 1.4 1.5 1.6 1.7 1.8
total number of data points birch result cluster count threshold

Fig. 2: (a) For each pair (n, T ) of total number n of objects, running from 100 to
10,000, and threshold T ∈ {1.5, 1.6, 1.7}, we sampled n elements from a Gaussian
of radius R = 1, and applied tree-BIRCH 10,000 times to compute the probabilities
for each of the cluster counts 1, 2, and 3. Every count different from 1 is an error.
(b) For each threshold we sample 500 points from a single Gaussian of radius R = 1,
apply tree-BIRCH and record how often it returns the right number of clusters. This
is repeated 10,000 times for each T to obtain an estimate for the error probability.

According to the results presented in Figure 2a, there is no indication that the
number of objects in the cluster impacts the error probability. However the error
probability is clearly affected by the threshold parameter; the error drops below one
percent when the threshold value is greater than or equal to 1.6 · R.

Avoiding Cluster Combining

We use a mixture of two Gaussians with a cluster distance D = 6, both with radius
R = 1. Again, the error probabilities are not dependent on the total number of data
points, as shown in Figure 3a. This figure pertains only to the cluster distance 6.0.
To understand the situation for different cluster distances, consider Figure 3b. Here,
we see the dependence of the error probability on the threshold for several different
cluster distances. For small thresholds (T < 1.7) we witness cluster splitting which
6 Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel Küpper
(a)
threshold = 1.5 200000 error = 4.3%
1.0 191462

prob. of cluster counts

0.8 150000
0.6 found clusters
one three 100000
0.4 two four
0.2 50000
0.0 0 8356 180
0
threshold = 2.0 200000 error = 0.7%
prob. of cluster counts
1.0 198570
0.8 150000
0.6 found clusters
one three 100000
0.4 two four
0.2 50000
0.0 0 1414 16
0
threshold = 3.0 200000 error = 6.2%
1.0 187630
prob. of cluster counts

0.8 150000
0.6 found clusters
one three 100000
0.4 two four
0.2 50000
0.0 16 12348 6
0
2000 4000 6000 8000 10000 1 2 3 4
total number of data points birch result cluster count
(b)
0.25
cluster distance
0.20 5.0 6.5
5.5 7.0
error probability

0.15 6.0 7.5

0.10
0.05
0.01
0.00
1.5 2.0 2.5 3.0 3.5
threshold

Fig. 3: (a) For each pair (n, T ) of total number n of objects, running from 100 to
10,000, and threshold T ∈ {1.5, 2.0, 3.0}, we sampled 10, 000 times n elements from
a mixture of two Gaussians of distance 6.0, and each time applied tree-BIRCH to
compute the probabilities for each of the cluster counts 1, 2, 3, and 4. Every count
different from 2 is an error. (b) For each pair (D, T ) of cluster distance D and thresh-
old T , we sampled 10, 000 times 500 elements from a mixture of two Gaussians of
radius R = 1 and distance D. Each time we applied tree-BIRCH with the threshold
set to T and computed the error probabilities.
A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm 7

results in higher cluster counts and a higher error. This can also be deduced from
Figure 3a, where for the small threshold T = 1.5 we see many cluster counts of three
and four. With T = 2.0 less splitting occurs and the error probability decreases. If the
threshold continues growing (T = 3.0), cluster combining occurs more frequently,
which increases the cluster counts of three and therefore increases the error prob-
ability. The fact that the graphs in Figure 3b are dropping below one percent later
than in Figure 2b is due to cluster combining, that was not possible with just one
cluster. For D ≥ 6.0 there are thresholds where the error probability drops below one
percent. The minimum is located near 1.9. From this observation it can be inferred,
that, if Dmin ≥ 6.0 · R, a choice of

T = 1.9 · R (1)
would ensure that for each pair of neighboring clusters in the dataset, the error prob-
ability would be less than one percent.

0.25
cluster distance
8.0 18.0
0.20 10.5 20.5
13.0 23.0
15.5 25.5
error probability

0.15

0.10

0.05

0.01
0.00
2 4 6 8 10 12
threshold

Fig. 4: For each pair (D, T ) of cluster distance D and threshold T , we sampled
10, 000 times 500 elements from a mixture of two Gaussians of radius R = 1 and
distance D. Each time we applied tree-BIRCH to compute the error probabilities.

Of course, if Dmin is clearly larger than 6.0 · R, it would be beneficial to increase

the threshold beyond (1). While the lower bound on the threshold is nearly the same
for all cluster distances, the upper bound increases linearly, roughly with half the in-
crease of the distance (Figure 4). This is intuitively clear since two times the thresh-
old should fit comfortably between two clusters to avoid cluster combining. To place
the threshold in the middle between lower and upper bound, we choose 14 as the ra-
tio of ∆ Dmin and ∆ T . We then fit an intercept of 0.7, which yields the following
expression for choosing the threshold (in arbitrary units), provided Dmin ≥ 6.0 · R:
1
T = Dmin + 0.7 · R. (2)
4
8 Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel Küpper

A-BIRCH with parallel Gap Statistic

We want to build a fast clustering algorithm in order to enable scalability. While
tree-BIRCH is very fast, Gap Statistic is not. Therefore, we developed a parallel
version of Gap Statistic. Recall that Gap Statistic runs k-means for each cluster
count k ∈ {1, . . . , kmax } not only on the dataset itself, but also on many Monte Carlo
simulations (the R and MATLAB implementations choose 100 simulations as de-
fault value). Therefore, we parallelized the loop over the Monte Carlo reference
simulations. The distribution of work and collection of the results are performed by
Apache Spark. The proposed approach and the results from Section 5 are summa-
rized in Algorithm 1.
Algorithm 1: A-BIRCH: Automatic threshold for the BIRCH algorithm
Data: N 2-dimensional data points {Xi }, kmax , number of Monte Carlo simulations B,
distance metric D, branching factor Br
Result: CF-tree
begin
k∗ ← parallel Gap Statistic subsample({Xi }), kmax , B

labels ← k-means subsample({Xi }), k∗

compute the common radius R and the minimal Dmin from the clustered data
if Dmin < 6 · R then
Warning: the clusters are too close - BIRCH result might be inaccurate
T ← 41 Dmin + 0.7 · R

CF-tree ← tree-BIRCH {Xi }, distance metric D, branching factor Br, T

6 Evaluation

We evaluated the accuracy of A-BIRCH with the threshold estimation as stated in

Equation (2). Figure 5 shows that A-BIRCH performs correctly with different sizes
of Dmin and different numbers of clusters. The evaluation datasets contain samples
from two-dimensional isotropic Gaussian distributions with equal variance and a
Dmin ≥ 6.0 · R, which fulfills the previously defined requirement.
In an additional step, we evaluated the scalability of A-BIRCH. While tree-
BIRCH is considered scalable with an increasing number of elements and clusters,
Gap Statistic is the bottleneck as described in 5. We have tested the parallelized im-
plementation of Gap Statistic on an Apache Spark cluster on Microsoft Azure. Two
cluster configurations have been evaluated, each with two master nodes and with
four and eight workers, respectively, each of which running on Standard_D3
virtual machines. A Standard_D3 virtual machine currently provides 4 CPU
cores, 14GB of memory, running the Linux operating system. The parallelization
has been implemented using the Spark Python API (PySpark). The computation of
Gap Statistic was run on a dataset with 10 clusters, each consisting of 1000 two-
dimensional data points. The computation times for varying numbers B of reference
datasets and maximal number of clusters kmax are shown in Table 1.
A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm 9

Fig. 5: The datasets A, B, C and D contain 3, 10, 100 and 200 clusters, respectively.
Each cluster consists of 1000 elements, the radius of the clusters is R = 1, and the
Dmin is in all cases larger than 6: in A - 6.001, in B - 7.225, in C - 6.025, in D -
6.410.

sequential Spark: 4 workers Spark: 8 workers

B = 100, kmax = 20 1775 s 349 s 197 s
B = 100, kmax = 40 7114 s 1425 s 795 s
B = 500, kmax = 20 8803 s 1470 s 725 s
B = 500, kmax = 40 35242 s 5953 s 2909 s

Table 1: Speedup of Gap Statistic by parallelization on Spark

The results show that the parallelized implementation of Gap Statistic with Spark
is scalable as the computation times decrease linearly with an increasing number
of worker nodes. Although the Gap Statistic phase is considered computationally
expensive, it increases the correctness of BIRCH significantly and does not require
any prior knowledge on the dataset.

7 Conclusion

In this paper we introduced A-BIRCH: a parameter-free variant of BIRCH. Choos-

ing the correct parameters for clustering algorithms is often difficult as it requires
information about the dataset, which is often not available. This is also true for
BIRCH, which requires the cluster count k as well as a threshold T in order to com-
pute the clusters correctly. For this reason, we removed the global clustering phase,
thus rendering the cluster count parameter k unnecessary, and proposed a method
that automatically estimates the threshold T , which is achieved using Gap Statis-
tic to determine cluster properties. The evaluation proved the applicability of our
approach in a very robust manner for two-dimensional isotropic Gaussian distribu-
tions with roughly the same variance, regardless of the number of clusters or it’s
elements. In future work, we plan to use other methods such as DBSCAN to either
verify the found cluster properties or to decrease the computational complexity.
10 Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel Küpper

References

1. Akaike, H.: Information Theory and an Extension of the Maximum Likelihood Principle, pp.
199–213. Springer New York (1998)
2. Burbeck, K., Nadjm-Tehrani, S.: Adaptive real-time anomaly detection with incremental clus-
tering. Information Security Technical Report 12(1), pp. 56 – 67 (2007)
3. Dash, M., Liu, H., Xu, X.: ’1 + 1 > 2’: Merging distance and density based clustering. In:
Database Systems for Advanced Applications, 2001. Proceedings. Seventh International Con-
ference on, pp. 32–39 (2001)
4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the
em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), pp.
1–38 (1977)
5. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters
in large spatial databases with noise. In: E. Simoudis, J. Han, U.M. Fayyad (eds.) Second
International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI
Press (1996)
6. Ismael, N., Alzaalan, M., Ashour, W.: Improved multi threshold birch clustering algorithm.
International Journal of Artificial Intelligence and Applications for Smart Devices 2(1), pp.
1–10 (2014)
7. Jordan, M.I., Bach, F.R., Bach, F.R.: Learning spectral clustering. In: Advances in Neural
Information Processing Systems 16. MIT Press (2003)
8. Kumar, N.S.L.P., Satoor, S., Buck, I.: Fast parallel expectation maximization for gaussian mix-
ture models on gpus using cuda. In: 11th IEEE International Conference on High Performance
Computing and Communications, pp. 103–109 (2009)
9. Macqueen, J.B.: Some Methods for classification and analysis of multivariate observations.
In: Proceedings of the Fifth Berkeley Symposium on Math, Statistics, and Probability, vol. 1,
pp. 281–297. University of California Press (1967)
10. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai,
D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar,
A.: MLlib: machine learning in apache spark. CoRR (2015)
11. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co.
(2011)
12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research 12, pp. 2825–2830 (2011)
13. Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), pp. 461–464
(1978)
14. Sugar, C., of Statistics, S.U.D.: Techniques for clustering and classification with applications
to medical problems. Stanford University (1998)
15. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the
gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2),
pp. 411–423 (2001)
16. Zechner, M., Granitzer, M.: Accelerating k-means on the graphics processor via cuda. In: First
International Conference on Intensive Applications and Services, pp. 7–15 (2009)
17. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its
applications. Data Mining and Knowledge Discovery 1(2), pp. 141–182 (1997)
18. Zhou, B., Hansen, J.: Unsupervised audio stream segmentation and clustering via the bayesian
information criterion. In: Proceedings of ISCLP-2000: International Conference of Spoken
Language Processing, pp. 714–717 (2000)

View publication stats

Catering Management System Project Report
50% (4)
Catering Management System Project Report
88 pages
Foxpro Commands
90% (10)
Foxpro Commands
10 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
(Interdisciplinary Applied Mathematics 40) René Vidal, Yi Ma, S.S. Sastry (Auth.) - Generalized Principal Component Analysis-Springer-Verlag New York (2016)
No ratings yet
(Interdisciplinary Applied Mathematics 40) René Vidal, Yi Ma, S.S. Sastry (Auth.) - Generalized Principal Component Analysis-Springer-Verlag New York (2016)
590 pages
g12 Important Questions Database Concepts
No ratings yet
g12 Important Questions Database Concepts
7 pages
Python Data Science Handbook Python Data Science Handbook
0% (1)
Python Data Science Handbook Python Data Science Handbook
5 pages
Common As400 Commands
100% (2)
Common As400 Commands
3 pages
Fuzzy Soft Set Theory and Its Applications
No ratings yet
Fuzzy Soft Set Theory and Its Applications
12 pages
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
No ratings yet
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
192 pages
Building CRM Desktop Applica-Tion Using Java 8 PDF
No ratings yet
Building CRM Desktop Applica-Tion Using Java 8 PDF
86 pages
Hair Et Al. - The Use of PLS-SEM in Strategic Management Research
No ratings yet
Hair Et Al. - The Use of PLS-SEM in Strategic Management Research
22 pages
Database Languages in DBMS
No ratings yet
Database Languages in DBMS
7 pages
Getting Started With Objectarx
No ratings yet
Getting Started With Objectarx
43 pages
SQL Major Method 2
100% (1)
SQL Major Method 2
6 pages
MTA Exam 98-361 Software Development
No ratings yet
MTA Exam 98-361 Software Development
5 pages
Data Mining and Knowledge Discovery For Big Data - Methodologies, Challenge and Opportunities (Chu 2013-10-09)
No ratings yet
Data Mining and Knowledge Discovery For Big Data - Methodologies, Challenge and Opportunities (Chu 2013-10-09)
310 pages
CSF011G04 - OS Application & Database Security
No ratings yet
CSF011G04 - OS Application & Database Security
40 pages
A Survey On Big Data Analytics Challenges, Open Research Issues and Tools
No ratings yet
A Survey On Big Data Analytics Challenges, Open Research Issues and Tools
11 pages
Advances in Data Clustering Theory and Applications by F Dornaika
No ratings yet
Advances in Data Clustering Theory and Applications by F Dornaika
225 pages
BFDFG
No ratings yet
BFDFG
49 pages
Harmya
No ratings yet
Harmya
33 pages
Dbms Merged Notes
No ratings yet
Dbms Merged Notes
70 pages
Iso TS 29585-2010
No ratings yet
Iso TS 29585-2010
64 pages
Time Is Money
No ratings yet
Time Is Money
18 pages
Build A Quiz Application With Python
No ratings yet
Build A Quiz Application With Python
47 pages
Klapper Et Al 2023
No ratings yet
Klapper Et Al 2023
27 pages
Bigdata Researched
No ratings yet
Bigdata Researched
34 pages
Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005
No ratings yet
Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005
35 pages
Information Sciences: C.L. Philip Chen, Chun-Yang Zhang
No ratings yet
Information Sciences: C.L. Philip Chen, Chun-Yang Zhang
34 pages
Straalsund JHE 2018
No ratings yet
Straalsund JHE 2018
22 pages
Model-Based Experimental Screening For DOC Parameter Estimation
No ratings yet
Model-Based Experimental Screening For DOC Parameter Estimation
15 pages
Big Data Analytics in Cloud Computing: An Overview
No ratings yet
Big Data Analytics in Cloud Computing: An Overview
11 pages
Question Bank For II Mid Exams-Dbms
No ratings yet
Question Bank For II Mid Exams-Dbms
14 pages
Surveyofclusteringmethods
No ratings yet
Surveyofclusteringmethods
29 pages
chp3A10.10072F978 3 319 08976 8 - 16
No ratings yet
chp3A10.10072F978 3 319 08976 8 - 16
15 pages
BC2017
No ratings yet
BC2017
28 pages
Mccoy Etal 2023
No ratings yet
Mccoy Etal 2023
21 pages
Visualizing Trees and Forests
No ratings yet
Visualizing Trees and Forests
24 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
Jos 2015 0017
No ratings yet
Jos 2015 0017
20 pages
Understanding The Meaning of Awareness in Research Networks: September 2012
No ratings yet
Understanding The Meaning of Awareness in Research Networks: September 2012
19 pages
Slicing A New Approach To Privacy Preserving Data Publishing
No ratings yet
Slicing A New Approach To Privacy Preserving Data Publishing
19 pages
SQL Quiz Window Functions 24dec2023
No ratings yet
SQL Quiz Window Functions 24dec2023
20 pages
Computer Science - Xii - Question Paperfinal
No ratings yet
Computer Science - Xii - Question Paperfinal
9 pages
Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA
No ratings yet
Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA
15 pages
Surveyofclusteringmethods
No ratings yet
Surveyofclusteringmethods
29 pages
2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation
No ratings yet
2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation
34 pages
Complex Digital Signal Processing in Telecommunications: November 2011
No ratings yet
Complex Digital Signal Processing in Telecommunications: November 2011
23 pages
Lab 2 Its
No ratings yet
Lab 2 Its
13 pages
Haacloud Report
No ratings yet
Haacloud Report
14 pages
Neuropsicologia Conducta Suicida
No ratings yet
Neuropsicologia Conducta Suicida
12 pages
Texto 1 Kolb
No ratings yet
Texto 1 Kolb
43 pages
Artículo SmalldatainEraBigData
No ratings yet
Artículo SmalldatainEraBigData
14 pages
Deep Learning Applications and Challenges in
No ratings yet
Deep Learning Applications and Challenges in
22 pages
SAP Community Network Wiki - Enterprise Performance Management - Top 15 BW Transactions Useful For BPC NW
No ratings yet
SAP Community Network Wiki - Enterprise Performance Management - Top 15 BW Transactions Useful For BPC NW
6 pages
A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
No ratings yet
A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
12 pages
Big Data A Survey - The New Paradigms Me
No ratings yet
Big Data A Survey - The New Paradigms Me
10 pages
Tutorial DataMiningENG
No ratings yet
Tutorial DataMiningENG
8 pages
Spanner
No ratings yet
Spanner
23 pages
Clustering Techniquesin Data Mining
No ratings yet
Clustering Techniquesin Data Mining
7 pages
A Random Sample Partition Data Model For Big Data
No ratings yet
A Random Sample Partition Data Model For Big Data
10 pages
Big Data: Understanding Big Data: January 2016
No ratings yet
Big Data: Understanding Big Data: January 2016
9 pages
E-Learning Platform Usage Analysis
No ratings yet
E-Learning Platform Usage Analysis
21 pages
A Subsampled Double Bootstrap For Massive Data
No ratings yet
A Subsampled Double Bootstrap For Massive Data
39 pages
Science 2010
No ratings yet
Science 2010
7 pages
2006 RoetertReidCrespo Introductiontomodernperiodisation ITFCSSR36
No ratings yet
2006 RoetertReidCrespo Introductiontomodernperiodisation ITFCSSR36
3 pages
Bayesian Reliability Assessment With Spatially Variable Measurements: The Spatial Averaging Approach
No ratings yet
Bayesian Reliability Assessment With Spatially Variable Measurements: The Spatial Averaging Approach
9 pages
Incremental Clustering by Fast Search and Find of Density Peaks
No ratings yet
Incremental Clustering by Fast Search and Find of Density Peaks
7 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
7 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
2010 - Zhang - EAP For AR Review PDF
No ratings yet
2010 - Zhang - EAP For AR Review PDF
8 pages
Dbms Lab # 4: SQL Wildcards & Operators
No ratings yet
Dbms Lab # 4: SQL Wildcards & Operators
10 pages
AReviewof Clustering Algorithms
No ratings yet
AReviewof Clustering Algorithms
8 pages
Dr.M.rajamanickam 07 2023
No ratings yet
Dr.M.rajamanickam 07 2023
5 pages
Clustering Xu R and Wunsch DC 2008 Book Review
No ratings yet
Clustering Xu R and Wunsch DC 2008 Book Review
3 pages
An Interactive Interface For Novel Class Discovery in Tabular Data
No ratings yet
An Interactive Interface For Novel Class Discovery in Tabular Data
6 pages
Stress Dependent Regulation of FOXO Transcription Factors by The SIRT1 Deacetylase
No ratings yet
Stress Dependent Regulation of FOXO Transcription Factors by The SIRT1 Deacetylase
7 pages
Spring Boot Developer Resume
No ratings yet
Spring Boot Developer Resume
6 pages
De Goeij Et Al 2013
No ratings yet
De Goeij Et Al 2013
5 pages
Editorial 21 PDF
No ratings yet
Editorial 21 PDF
5 pages
Data Science
No ratings yet
Data Science
20 pages
C 23 IEEE ImgClassiSurvey ND
No ratings yet
C 23 IEEE ImgClassiSurvey ND
5 pages
Visual Clustering Approaches
No ratings yet
Visual Clustering Approaches
3 pages
Advanced Excel Final Assessment Without Answer
No ratings yet
Advanced Excel Final Assessment Without Answer
8 pages
ProductFlyer 9783030143015
No ratings yet
ProductFlyer 9783030143015
2 pages
Sohil Hendre Data Engineer New May ZS - PDF (1) .PDF - 20240927 - 145457 - 0000
No ratings yet
Sohil Hendre Data Engineer New May ZS - PDF (1) .PDF - 20240927 - 145457 - 0000
2 pages
SmartCity IoT BigData Cloud Cybersecurity
No ratings yet
SmartCity IoT BigData Cloud Cybersecurity
2 pages
Kumaraman07226 6725dd069a1d2
No ratings yet
Kumaraman07226 6725dd069a1d2
1 page
Graph-Based Clustering and Data Visualization Algorithms
No ratings yet
Graph-Based Clustering and Data Visualization Algorithms
1 page
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm

Uploaded by

ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm

Conference Paper in Advances in Intelligent Systems and Computing · October 2017

Boris Lorbeer Ana Kosareva

SEE PROFILE SEE PROFILE

Bersant Deva Peter Ruppel

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

In the following, we present a method that automatically chooses an optimal thresh-

A B C Cluster radius Cluster distance Cluster radius

5 Automatic Estimation of the BIRCH Threshold

Avoiding Cluster Splitting

we determine whether tree-BIRCH returns the correct number of clusters, namely

0.8 150000 0.020

Avoiding Cluster Combining

prob. of cluster counts

0.15 6.0 7.5

Of course, if Dmin is clearly larger than 6.0 · R, it would be beneficial to increase

A-BIRCH with parallel Gap Statistic

labels ← k-means subsample({Xi }), k∗

We evaluated the accuracy of A-BIRCH with the threshold estimation as stated in

sequential Spark: 4 workers Spark: 8 workers

Table 1: Speedup of Gap Statistic by parallelization on Spark

In this paper we introduced A-BIRCH: a parameter-free variant of BIRCH. Choos-

View publication stats

You might also like