Data Mining Using Conceptual Clustering
Data Mining Using Conceptual Clustering
Data Mining Using Conceptual Clustering
Clustering
By
Trupti Kadam
What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Origins of Data Mining
P P
… … … Merge … …
A B
N
…
A B
• The COBWEB algorithm considers spliting
the existing child node with the highest
score.
P P
… … Split … … …
A B
N
…
A B
The COBWEB Algorithm Cobweb(N, I)
If N is a terminal node,
Then Create-new-terminals(N, I)
Incorporate(N,I).
Else Incorporate(N, I).
Input: The current node N in the concept hierarchy. For each child C of node N,
An unclassified (attribute-value) instance I. Compute the score for placing I in C.
Let P be the node with the highest score W.
Results: A concept hierarchy that classifies the
instance. Let Q be the node with the second highest score.
Top-level call: Cobweb(Top-node, I). Let X be the score for placing I in a new node R.
Let Y be the score for merging P and Q into one node.
Variables: C, P, Q, and R are nodes in the hierarchy.
Let Z be the score for splitting P into its children.
U, V, W, and X are clustering (partition) scores.
If W is the best score,
Then Cobweb(P, I) (place I in category P).
Else if X is the best score,
Then initialize R’s probabilities using I’s values
(place I by itself in the new category R).
Else if Y is the best score,
Then let O be Merge(P, R, N).
Cobweb(O, I).
Else if Z is the best score
Then Split(P, N).
Cobweb(N, I).
• Limitations of COBWEB
– The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
– Not suitable for clustering large database data – skewed
tree and expensive probability distributions
ITERATE
The algorithm has three primary steps:
1. Derive a classification tree using category utility
as a criterion function for grouping instances.
2. Extract a good initial partition of data from the
classification tree as a starting point to focus the
search for desirable groupings or clusters.
3. Iteratively redistribute data objects among the
groupings to achieve maximally separable clusters.
Derivation of classification tree
The initial partition structure is extracted by
comparing the CU value of classes or nodes
along a path in the classification tree. For any
path from root to leaf of a classification tree this
value initially increases, and then drops .
Extraction of a good initial partition
Iteratively redistribute data objects
• The iterative redistribution operator is applied
to maximize the cohesion measure for
individual classes in the partition.
• The redistribution operator assigns object d to
class k for which the category match measure
CMdk is maximum.
Evaluating Cluster Partitions
• To be able to assess the result of a certain
clustering operation, we adopt a measure
known as cohesion, which measures the
degree of interclass similarity between objects
in the same class.
• The increase in predictability for an object for
an object d assigned to cluster k, Mdk is
defined as
THANK YOU