0% found this document useful (0 votes)
219 views11 pages

Birch Clustering

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a memory-efficient hierarchical clustering algorithm designed for large datasets, which summarizes data using Clustering Features (CF) to minimize memory requirements. It constructs a CF-tree that allows for efficient clustering without scanning all data points, and it is best suited for numeric attributes. While it offers advantages like a single scan and improved cluster quality, it is limited to metric attributes only.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views11 pages

Birch Clustering

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a memory-efficient hierarchical clustering algorithm designed for large datasets, which summarizes data using Clustering Features (CF) to minimize memory requirements. It constructs a CF-tree that allows for efficient clustering without scanning all data points, and it is best suited for numeric attributes. While it offers advantages like a single scan and improved cluster quality, it is limited to metric attributes only.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Birch

Clustering
GAYATHRI PRAS AD S
BIRCH(Balanced Iterative Reducing
and Clustering hierarchies)
The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a
hierarchical clustering algorithm. It provides a memory-efficient clustering method
for large datasets. Clustering is conducted without scanning all points in a dataset.
The BIRCH algorithm creates Clustering Features (CF) Tree for a given dataset and
CF contains the number of sub-clusters that holds only a necessary part of the
data. Thus the method does not require to memorize the entire dataset.

BIRCH actually complements other clustering algorithms by virtue of the fact that
different clustering algorithms can be applied to the summary produced by BIRCH.
BIRCH can only deal with metric attributes. A metric attribute is one whose values
can be represented by explicit coordinates in an Euclidean space (no categorical
variables).
Clustering Feature (CF)
BIRCH attempts to minimize the memory requirements of large datasets by
summarizing the information contained in dense regions as Clustering Feature
(CF) entries.

Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS)
where ‘N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the
data points and ‘SS’ is the squared sum of the data points in the cluster. It is
possible for a CF entry to be composed of other CF entries.
CF Tree
The CF-tree is a very compact representation of the dataset because each entry in a leaf node
is not a single data point but a subcluster.

Each non leaf node contains at most B entries. In this context, a single entry contains a pointer
to a child node and a CF made up of the sum of the CFs in the child (subclusters of
subclusters).

A leaf node contains at most L entries, and each entry is a CF (subclusters of data points).

All entries in a leaf node must satisfy a threshold requirement. That is to say, the diameter of
each leaf entry has to be less than Threshold. When threshold is larger, CF tree is smaller.

Each node must fit a memory page.

In addition, every leaf node has two pointers, prev and next, which are used to chain all leaf
nodes together for efficient scans.
Parameters
•threshold : Threshold is the maximum number of data points a sub-cluster in
the leaf node of the CF tree can hold. The closest sub-cluster should be lesser
than the threshold value.
•branching_factor : This parameter specifies the maximum number of CF sub-
clusters in each node (internal node). If a new data instance arrives such that the
number of sub-clusters surpass the branching factor then that node should divide
into two nodes with the sub-clusters redistributed in each.

•n_clusters : The number of clusters to be returned after the entire BIRCH


algorithm is complete i.e., number of clusters after the final clustering step. If set
to None, the final clustering step is not performed and intermediate clusters are
returned.
Features
The BIRCH algorithm is more suitable for the case where the amount of data is
large and also the number of categories is relatively large.

It is a good algorithm with the advantages of a single scan, and also, the CF-tree
feature increases the quality of clusters.

The one thing where it lags is, it uses only numeric or vector data.
Thank You

You might also like