Data Mining Using Conceptual Clustering

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Data Mining Using Conceptual

Clustering

By
Trupti Kadam
What is Data Mining?

• Many Definitions
– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Origins of Data Mining

• Draws ideas from machine learning/AI, pattern


recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to Statistics/ Machine Learning/
– Enormity of data AI Pattern
Recognition
– High dimensionality
Data Mining
of data
– Heterogeneous,
distributed nature Database
systems
of data
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among
them, find clusters such that
– Data points in one cluster are more similar to one
another.
– Data points in separate clusters are less similar to
one another.
Conceptual Clustering
• Unsupervised, spontaneous - categorizes or
postulates concepts without a teacher
• Conceptual clustering forms a classification tree - all
initial observations in root - create new children
using single attribute (not good), attribute
combinations (all), information metrics, etc. - Each
node is a class
• Should decide quality of class partition and
significance (noise)
• Many models use search to discover hierarchies
which fulfill some heuristic within and/or between
clusters - similarity, cohesiveness, etc.
Concept Under CC
Concept Hierarchy
Contd..
• Suppose we choose 6 as threshold value for
similarity, the algo produce 5 distinct clusters
(1,2),(3,4),(5,6,7,8),(5,6),(5,7,8) after deleting
redundant one and a hierarchy is formed as
follows:
Contd..
The COBWEB Conceptual Clustering
Algorithm
• The COBWEB algorithm was developed by
machine learning researchers in the 1980s for
clustering objects in a object-attribute data
set.
• The COBWEB algorithm yields a clustering
dendogram called classification tree that
characterizes each cluster with a probabilistic
description.
Contd..
• When given a new instance, COBWEB
considers the overall quality of either placing
the instance in an existing category or
modifying the hierarchy
• The criterion COBWEB uses for evaluating the
quality of the classification is called category
utility
Category utility
• Was developed in research of human
categorization (Gluck and Corter 1985)
• Category utility attempts to maximize both the
probability that two objects in the same category
have values in common and the probability that
objects in different categories will have different
property values.
• Manhattan distance or Euclidean distance formula
is used to measure cohesion among clusters.
Category utility

P(Ck) represents size of cluster Ck.

represents probability of attribute Ai


taking on value V ij over the entire set, and

is its conditional probability of taking


the same value in class k C .
• To evaluate an entire partition made up of K clusters,
we use the average CU over the K clusters
The Classification Tree Generated by the
COBWEB Algorithm
• COBWEB performs a hill-climbing search of
the space of possible taxonomies (trees) using
category utility to evaluate and select possible
categorizations
– Initializes the taxonomy to a single category whose
features are those of the first example
– For each example, the algorithm begins with the root
category and moves through the tree
– At each level is uses category utility to evaluate the
taxonomies

1. Placing the example in the best category


2. Adding a new category containing the example
3. Merging two existing categories and adding the example
to the category
4. Splitting two existing categories and placing the example
into the best category in the tree
• Insertion means that the new object is
inserted into one of the existing child
nodes. The COBWEB algorithm evaluates
the respective CU function value of
inserting the new object into each of the
existing child nodes and selects the one
with the highest score.
• The COBWEB algorithm also considers
creating a new child node specifically for
the new object.
• The COBWEB algorithm considers merging
the two existing child nodes with the
highest and second highest scores.

P P
… … … Merge … …

A B
N

A B
• The COBWEB algorithm considers spliting
the existing child node with the highest
score.

P P
… … Split … … …

A B
N

A B
The COBWEB Algorithm Cobweb(N, I)
If N is a terminal node,
Then Create-new-terminals(N, I)
Incorporate(N,I).
Else Incorporate(N, I).
Input: The current node N in the concept hierarchy. For each child C of node N,
An unclassified (attribute-value) instance I. Compute the score for placing I in C.
Let P be the node with the highest score W.
Results: A concept hierarchy that classifies the
instance. Let Q be the node with the second highest score.
Top-level call: Cobweb(Top-node, I). Let X be the score for placing I in a new node R.
Let Y be the score for merging P and Q into one node.
Variables: C, P, Q, and R are nodes in the hierarchy.
Let Z be the score for splitting P into its children.
U, V, W, and X are clustering (partition) scores.
If W is the best score,
Then Cobweb(P, I) (place I in category P).
Else if X is the best score,
Then initialize R’s probabilities using I’s values
(place I by itself in the new category R).
Else if Y is the best score,
Then let O be Merge(P, R, N).
Cobweb(O, I).
Else if Z is the best score
Then Split(P, N).
Cobweb(N, I).
• Limitations of COBWEB
– The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
– Not suitable for clustering large database data – skewed
tree and expensive probability distributions
ITERATE
The algorithm has three primary steps:
 
1. Derive a classification tree using category utility
as a criterion function for grouping instances.
2. Extract a good initial partition of data from the
classification tree as a starting point to focus the
search for desirable groupings or clusters.
3. Iteratively redistribute data objects among the
groupings to achieve maximally separable clusters.
Derivation of classification tree
The initial partition structure is extracted by
comparing the CU value of classes or nodes
along a path in the classification tree. For any
path from root to leaf of a classification tree this
value initially increases, and then drops .
Extraction of a good initial partition
Iteratively redistribute data objects
• The iterative redistribution operator is applied
to maximize the cohesion measure for
individual classes in the partition.
• The redistribution operator assigns object d to
class k for which the category match measure
CMdk is maximum.
Evaluating Cluster Partitions
• To be able to assess the result of a certain
clustering operation, we adopt a measure
known as cohesion, which measures the
degree of interclass similarity between objects
in the same class.
• The increase in predictability for an object for
an object d assigned to cluster k, Mdk is
defined as
THANK YOU

You might also like