Birch Clustering
Birch Clustering
Clustering
GAYATHRI PRAS AD S
BIRCH(Balanced Iterative Reducing
and Clustering hierarchies)
The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a
hierarchical clustering algorithm. It provides a memory-efficient clustering method
for large datasets. Clustering is conducted without scanning all points in a dataset.
The BIRCH algorithm creates Clustering Features (CF) Tree for a given dataset and
CF contains the number of sub-clusters that holds only a necessary part of the
data. Thus the method does not require to memorize the entire dataset.
BIRCH actually complements other clustering algorithms by virtue of the fact that
different clustering algorithms can be applied to the summary produced by BIRCH.
BIRCH can only deal with metric attributes. A metric attribute is one whose values
can be represented by explicit coordinates in an Euclidean space (no categorical
variables).
Clustering Feature (CF)
BIRCH attempts to minimize the memory requirements of large datasets by
summarizing the information contained in dense regions as Clustering Feature
(CF) entries.
Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS)
where ‘N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the
data points and ‘SS’ is the squared sum of the data points in the cluster. It is
possible for a CF entry to be composed of other CF entries.
CF Tree
The CF-tree is a very compact representation of the dataset because each entry in a leaf node
is not a single data point but a subcluster.
Each non leaf node contains at most B entries. In this context, a single entry contains a pointer
to a child node and a CF made up of the sum of the CFs in the child (subclusters of
subclusters).
A leaf node contains at most L entries, and each entry is a CF (subclusters of data points).
All entries in a leaf node must satisfy a threshold requirement. That is to say, the diameter of
each leaf entry has to be less than Threshold. When threshold is larger, CF tree is smaller.
In addition, every leaf node has two pointers, prev and next, which are used to chain all leaf
nodes together for efficient scans.
Parameters
•threshold : Threshold is the maximum number of data points a sub-cluster in
the leaf node of the CF tree can hold. The closest sub-cluster should be lesser
than the threshold value.
•branching_factor : This parameter specifies the maximum number of CF sub-
clusters in each node (internal node). If a new data instance arrives such that the
number of sub-clusters surpass the branching factor then that node should divide
into two nodes with the sub-clusters redistributed in each.
It is a good algorithm with the advantages of a single scan, and also, the CF-tree
feature increases the quality of clusters.
The one thing where it lags is, it uses only numeric or vector data.
Thank You