0% found this document useful (0 votes)
325 views63 pages

Unit 5 Mining Frequent Patterns and Cluster Analysis

This document discusses frequent pattern mining and market basket analysis. It begins by defining frequent patterns as patterns that appear frequently in a dataset, such as itemsets, subsequences, or substructures. A common application is market basket analysis, which analyzes customer purchasing habits to find associations between items frequently bought together. The document then covers the Apriori algorithm for finding frequent itemsets, which uses an iterative level-wise approach and the property that subsets of frequent itemsets must also be frequent.

Uploaded by

Ruchira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
325 views63 pages

Unit 5 Mining Frequent Patterns and Cluster Analysis

This document discusses frequent pattern mining and market basket analysis. It begins by defining frequent patterns as patterns that appear frequently in a dataset, such as itemsets, subsequences, or substructures. A common application is market basket analysis, which analyzes customer purchasing habits to find associations between items frequently bought together. The document then covers the Apriori algorithm for finding frequent itemsets, which uses an iterative level-wise approach and the property that subsets of frequent itemsets must also be frequent.

Uploaded by

Ruchira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Unit 5

Mining Frequent Patterns


and Cluster Analysis
Frequent patterns :
• Frequent patterns are patterns (e.g., itemsets, subsequences, or
substructures) that appear frequently in a data set.
• For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a digital camera, and then
a memory card, if it occurs frequently in a shopping history database, is a
(frequent) sequential pattern.
• A substructure can refer to different structural forms, such as subgraphs,
subtrees, or sublattices, which may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern.
• Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
• It helps in data classification, clustering, and other data mining tasks.
• Thus, frequent pattern mining has become an important data mining task and a
focused theme in data mining research.
• Frequent pattern mining searches for recurring relationships in a given data set.
• Frequent pattern mining important for the discovery of interesting associations and
correlations between itemsets in transactional and relational databases.
Market Basket Analysis :
• A typical example of frequent itemset mining is market basket analysis. This
process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”.
• The discovery of these associations can help retailers to develop marketing
strategies by gaining insight into which items are frequently purchased
together by customers.
• Example : If customers are buying milk, how likely they also buy bread (and
what kind of bread) on the same trip to the supermarket?
• This information can lead to increased sales by helping retailers do selective
marketing and plan their shelf space.
Figure shows the buying strategy of customer.

Fig. : Market basket analysis.


Benefits of Market Basket Analysis :
• Store Layout: you can organize or set up your store according to market basket
analysis in order to increase revenue.
• Marketing Messages: Market basket analysis increase the efficiency of
marketing messages whether it is done by phone, email, social media etc. You
can suggest the next best option to the customers by using market business
analysis data.
• Maintain Inventory: If you have done market basket analysis then you may know
what are the products that your customers are going to buy in future and you
can maintain your inventory accordingly. You can also predict the future
purchase of customers over a period of time on the basis of market basket
analysis data.
• Content Placement: Marketing basket analysis is used by the online
retailers to display the content that is likely to read next by the customers.
It will help to engage customers on your website. Market basket analysis
helps to increase traffic on your website and to get better conversion rates.
• Recommendation Engines: Market basket analysis is the base for creating
recommendation engines. A recommendation engine is a software that
analyzes identifies and recommends content to users in which they are
interested.
Application of Market Basket Analysis :
Market basket analysis is applied to various fields of the retail sector in order to
boost sales and generate revenue by identifying the needs of the customers
and make purchase suggestions to them.
• Cross Selling: Cross-selling is basically a sales technique in which seller
suggests some related product to a customer after he buys a product.
Market basket analysis helps the retailer to know the consumer behavior
and then go for cross-selling.
• Product Placement: It refers to placing the complimentary (pen and
paper)and substitute goods (tea and coffee) together so that the customer
addresses the goods and will buy both the goods together. Market basket
analysis helps the retailer to identify the goods that a customer can
purchase together.
• Affinity Promotion: Affinity promotion is a method of promotion that design
promotional events based on associated products. Market basket analysis
affinity promotion is a useful way to prepare and analyze questionnaire data.
• Fraud Detection: Market basket analysis is also applied to fraud detection.
Based on credit card data, it may be possible to detect certain purches
behavior that can be associated with fraud.
• Customer Behavior: Market basket analysis helps to understand customer
behavior under different conditions. It allows the retailer to identify the
relationship between two products that people tend to buy and hence helps
to understand the customer behavior towards a product or service.
Basic Concepts :
• Itemset
• Support
• Confidence
• Frequent Itemsets
• Closed Itemsets
• Association Rules

o support and confidence are two measures of rule interestingness.


o They respectively reflect the usefulness and certainty of discovered rules.
o Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by
users or domain experts.
Itemset :
• An itemset is a set of items.
• A transaction t is an itemset with associated transaction ID, t = (tid, I), where I is
the set of items of the transaction.
• A transaction t = (tid, I) contains itemset X if X ⊆ I

Support :
• This tells about usefulness and certainty of rules.
• The support of itemset X in database D is the number of transactions in D that
contain it: sup(X, D) = |{t ∈ D : t contains X}|
• Support_count(X) : Number of transactions in which X appears. If X is A union B
then it is the number of transactions in which A and B both are present.
• Support(A -> B) = Support_count(A ∪ B)
• 5% Support means total 5% of transactions in database follow the rule.
Confidence:
• The confidence or strength for an association rule A => B is the ratio of the
number of transactions that contain A U B to the number of transactions that
contain A.
• Consider a rule A => B, it is a measure of ratio of the number of tuples containing
both A and B to the number of tuples containing A.

tuples_containing_both_A_and_B
• Confidence A => B =
tuples_containing_A
• A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.
Frequent itemset :
• An itemset X is a frequent if X’s support is not less than a minimum support threshold.
• Frequent itemset is a set of items that appears at least in a pre-specified transactions.
• Frequent itemsets are typically used to generate Association rules.

Closed itemset :
An item set is closed if none of its immediate supersets have the same support as the
itemset.
Consider two itemset X and Y, if every item of X is in Y but there is at least one item of Y,
which is not in X, then Y is not a proper super-itemset set X.
In this case, itemset X is closed itemset.
If X is closed and frequent itemset then it is called as closed frequent itemset.
Association Rule :
The rules that satisfy both a minimum support threshold (min_sup) and a minimum
confidence threshold (min_conf) are called strong Association rule.
Frequent Itemset Mining Methods :

Apriori Algorithm : Finding Frequent Itemsets by Confined Candidate


Generation
• Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule.
• Name of the algorithm is Apriori because it uses prior knowledge of frequent
itemset properties.
• We apply an iterative approach or level-wise search where k-frequent itemsets are
used to find k+1 itemsets.
• To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori_property which helps by reducing the
search space.
Apriori Property –
• All non-empty subset of frequent itemset must be frequent.
• The key concept of Apriori algorithm is its anti-monotonicity of support measure.
• Anti-monotonicity - If a set is infrequent then all of its supersets are also
infrequent.
• Apriori assumes that :
• All subsets of a frequent itemset must be frequent(Apriori propertry).
• If an itemset is infrequent, all its supersets will be infrequent.
Consider the following dataset to find frequent itemsets.

• minimum support count is 2


• minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item
present in dataset – Called C1(candidate set)

(II) compare candidate set item’s support_count with


minimum_support_count(here min_support=2)
If support_count of candidate set items is less than
min_support then remove those items. This gives us
itemset L1.
Step-2: K=2
(I)
o Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-

1and Lk-1 is that it should have (K-2) elements in common.


o Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.
o (Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
o Now find support count of these itemsets by searching in dataset.
(II)
o compare candidate (C2) support count with minimum support count (here
min_support= 2)
o if support_count of candidate set item is less than min_support then remove those
items)
o this gives us itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-

1 is that it should have (K-2) elements in common. So here, for L2, first element
should match.
• So itemset generated by joining L2 is
{I1, I2, I3} {I1, I2, I5} {I1, I3, i5} {I2, I3, I4} {I2, I4, I5} {I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which
are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly
check for every itemset)
• find support count of these remaining itemset by searching in dataset.
(II)
o Compare candidate (C3) support count with minimum support count (here min_support=2)
o If support_count of candidate set item is less than min_support then remove those items)
o This gives us itemset L3.
Step-4:
(I)
o Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-

1(K=4) is that, they should have (K-2) elements in common. So here, for L3, first
2 elements (items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
o We stop here because no frequent itemsets are found further
Generation of strong association rule :
For that we need to calculate confidence of each rule.
Confidence –

tuples_containing_both_A_and_B
Confidence A => B =
tuples_containing_A

So here, by taking an example of any frequent itemset, we will show the rule
generation.
Frequent Itemset {I1, I2, I3} //from L3
SO rules can be :
• [I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
• [I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
• [I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
• [I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
• [I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
• [I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 60%, then no rule can be considered as strong association
rules.
Frequent Itemset {I1, I2, I5} //from L3
SO rules can be :
• [I1^I2]=>[I5] //confidence = sup(I1^I2^I5)/sup(I1^I2) = 2/4*100=50%
• [I1^I5]=>[I2] //confidence = sup(I1^I2^I5)/sup(I1^I5) = 2/2*100=100%
• [I2^I5]=>[I1] //confidence = sup(I1^I2^I5)/sup(I2^I5) = 2/2*100=100%
• [I1]=>[I2^I5] //confidence = sup(I1^I2^I5)/sup(I1) = 2/6*100=33%
• [I2]=>[I1^I5] //confidence = sup(I1^I2^I5)/sup(I2) = 2/7*100=28%
• [I5]=>[I1^I2] //confidence = sup(I1^I2^I5)/sup(I5) = 2/2*100=100%
So if minimum confidence is 60%, then the following rules can be considered as strong
association rules.
[I1^I5]=>[I2] confidence = 100%
[I2^I5]=>[I1] confidence = 100%
[I5]=>[I1^I2] confidence = 100%
Limitations Of Apriori Algorithm :
• Using Apriori needs a generation of candidate itemsets. These itemsets may
be large in number if the itemset in the database is huge.
• Apriori needs multiple scans of the database to check the support of each
itemset generated and this leads to high costs
Improving the Efficiency of Apriori :
Many variations of the Apriori algorithm have been proposed that focus on
improving the efficiency of the original algorithm.
Some variations are as follows:
• Hash-based technique (hashing itemsets into corresponding buckets)
• Transaction reduction (reducing the number of transactions scanned in
future iterations)
• Partitioning (partitioning the data to find candidate itemsets)
• Sampling (mining on a subset of the given data)
• Dynamic itemset counting (adding candidate itemsets at different
points during a scan)
A Pattern-Growth Approach for Mining Frequent Itemsets :
The limitations of Apriori pattern mining method can be overcome using the FP
growth algorithm.

Frequent Pattern Growth Algorithm :


• This algorithm is an improvement to the Apriori method. A frequent pattern is
generated without the need for candidate generation.
• FP growth algorithm represents the database in the form of a tree called a
frequent pattern tree or FP tree.
• This tree structure will maintain the association between the itemsets.
• The database is fragmented using one frequent item. This fragmented part is
called “pattern fragment”.
• The itemsets of these fragmented patterns are analyzed. Thus with this method,
the search for frequent itemsets is reduced comparatively.
FP Tree :
• Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of
the database.
• The purpose of the FP tree is to mine the most frequent pattern.
• Each node of the FP tree represents an item of the itemset.
• The root node represents null while the lower nodes represent the itemsets.
• The association of the nodes with the lower nodes that is the itemsets with the
other itemsets are maintained while forming the tree.
Frequent Pattern Algorithm Steps :
The frequent pattern growth method find the frequent pattern without candidate
generation.
The steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
1. The first step is to scan the database to find the occurrences of the itemsets in
the database. This step is the same as the first step of Apriori.
The count of 1-itemsets in the database is called support count or frequency of
1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree.
The root is represented by null.
3. The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it.
The itemset with the max count is taken at the top, the next itemset with lower
count and so on.
It means that the branch of the tree is constructed with transaction itemsets in
descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch
would share a common prefix to the root.
This means that the common itemset is linked to the new node of another
itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are
created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes.
The lowest node represents the frequency pattern length 1. From this, traverse
the path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP
tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, which is formed by a count of itemsets in the
path. The itemsets meeting the threshold support are considered in the
Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
https://youtu.be/VB8KWm8MXss
Example Of FP-Growth Algorithm
Support threshold = 50%, Confidence = 60%

Transaction List of items


T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold = 50% => 0.5 * 6 = 3 => min_sup = 3
1. Count of each item

Item Support_Count
I1 4
I2 5
I3 4
I4 4
I5 2

Consider the itemsets above minimum support count.


Here the support count of I5 is 2 less than minimum support count, remove it.
Item Support_Count
I2 5
2. Sort the itemset in descending order. I1 4
I3 4
I4 4

Transaction List of items Ordered Frequent


3. items
• Compare the sorted items with the T1 I1, I2, I3 I2, I1, I3
dataset. T2 I2, I3, I4 I2, I3, I4
• Arrange the itemset of each transaction in
T3 I4, I5 I4
descending order of their support count.
• If the item is having support count less than T4 I1, I2, I4 I2, I1, I4
min_sup then remove it. T5 I1, I2, I3, I5 I2, I1, I3
T6 I1, I2, I3, I4 I2, I1, I3, I4
4. Build FP Tree
1. Considering the root node null.

Null

2. The first scan of Transaction T1: I2, I1, I3 contains three items {I2:1}, {I1:1}, {I3:1},
where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.

Null

I2 : 1

I1 : 1

I3 : 1
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to
I2 and I4 is linked to I3. But this branch would share I2 node as common as
it is already used in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as
a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.

Null

I2 : 2

I3 : 1
I1 : 1
I4 : 1
I3 : 1
5. T3: I4. A new branch with I4 linked to Null as a child is created.

Null

I4 : 1
I2 : 2
I3 : 1

I1 : 1
I4 : 1

I3 : 1
6. T4 : I2, I1, I4. The sequence will be I2, I1, and I4. I2 is already linked to the
root node, hence it will be incremented by 1. Similarly I1 will be
incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2},
{I4:1}.

Null

I4 : 1
I2 : 3
I3 : 1

I1 : 2
I4 : 1

I3 : 1 I4 : 1
7. T5 : I2, I1, I3. The sequence will be I2, I1, I3. Thus {I2:4}, {I1:3}, {I3:2}.

Null

I4 : 1
I2 : 4
I3 : 1

I1 : 3
I4 : 1

I3 : 2 I4 : 1
8. T6 : I2, I1, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.

Null

I4 : 1
I2 : 5
I3 : 1

I1 : 4
I4 : 1

I3 : 3 I4 : 1

I4 : 1

If the occurrences of items are equal to the support then the FP tree is correct.
Mining frequent patterns from FP tree :
• Use FP tree and recursively grow frequent pattern path.(write atoms in support
count ascending order)
• For each item the conditional pattern base is constructed and the conditional FP
tree is also constructed.

Item Conditional Pattern Base Conditional FP tree


I4 {I2, I1, I3 : 1}, {I2, I1 : 1}, {I2,I3 : 1} {I2:3, I1:2, I3:2}
I3 {I2, I1 : 3}, {I2 : 1} {I2:4, I1:3}
I1 {I2 : 4} {I2 : 4}
I2 {} {}
Frequent Pattern Generation :

Item Frequent Patterns


I4 {I2, I4 : 3}
I3 {I2, I3 : 4}, {I1, I3 : 3}
I1 {I2, I1 : 4}
Generation of Association Rules :
Minimum support count = 3
Minimum confidence = 50%

For frequent pattern – I2, I4


I2 => I4 3/5 = 0.6 Confidence = 60%
I4 => I2 3/4 = 0.75 Confidence = 75%

For frequent pattern – I2, I3


I2 => I3 4/5 = 0.8 Confidence = 80%
I3 => I2 4/4 = 1 Confidence = 100%

For frequent pattern – I1, I3


I1 => I3 3/4 = 0.75 Confidence = 75%
I3 => I1 3/4 = 0.75 Confidence = 75%

For frequent pattern – I2, I3


I2 => I1 4/5 = 0.8 Confidence = 80%
I1 => I2 4/4 = 1 Confidence = 100%
Advantages Of FP Growth Algorithm :
• This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
• The pairing of items is not done in this algorithm and this makes it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm :
• FP Tree is more cumbersome and difficult to build than Apriori.
• It may be expensive.
• When the database is large, the algorithm may not fit in the shared memory.
Clustering :
• Clustering is the process of grouping a set of data objects into multiple groups
or clusters so that objects within a cluster have high similarity.
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
• Dissimilarities and similarities are assessed based on the attribute values
describing the objects and often involve distance measures.
• The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
• Clustering as a data mining tool has its roots in many application areas such as
biology, security, business intelligence, and Web search.
Applications of Cluster Analysis :
• Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies.
• Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information discovery.
• Clustering is also used in outlier detection applications such as detection of credit
card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining :
The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large
databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable
to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to
only distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to
poor quality clusters.
• Interpretability − The clustering results should be interpretable,
comprehensible, and usable.
Clustering Methods :
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method :
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
General Characteristics of Partitioning methods :
– Find mutually exclusive clusters of spherical shape
– Distance-based
– May use mean or medoid (etc.) to represent cluster center
– Effective for small- to medium-size data sets
Partitioning methods of clustering :
Partitioning methods construct a partition of a database on an object into a set of
K clusters

Different partitioning methods :


Global optimal method - exhaustively enumerate all partitions
Heuristic method - k - means and K - medoids algorithm
K - means : Each cluster is represented by the centre of the cluster
K - medoids : Or PAM (partitioning around medoids) : Each cluster is
represented by one of the objects in the cluster
Hierarchical Methods :
• This method creates a hierarchical decomposition of the given set of data objects.
• Hierarchical methods can be classify on the basis of how the hierarchical
decomposition is formed.
• Hierarchical methods suffer from the fact that once a merging or splitting is done, it
can never be undone.
• There are two approaches of Hierarchical methods −
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach :
• This approach is known as the bottom-up approach.
• It start with each object forming a separate group.
• It keeps on merging the objects or groups that are close to one another.
• It keep on doing until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach :
• This approach is known as the top-down approach.
• It start with all of the objects in the same cluster.
• In the continuous iteration, a cluster is split up into smaller clusters.
• It is down until each object in one cluster or the termination condition holds.
General Characteristics of Hierarchical methods :
– Clustering is a hierarchical decomposition (i.e., multiple levels)
– Cannot correct erroneous merges or splits
– May incorporate other techniques like microclustering or consider object “linkages”
Density-based Method :
• This method is based on the notion of density.
• The basic idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the neighborhood exceeds some threshold.
• Example : for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.
• Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
• Density-based methods can divide a set of objects into multiple exclusive
clusters, or a hierarchy of clusters.
• Typically, density-based methods consider exclusive clusters only, and do not
consider fuzzy clusters.
• Moreover, density-based methods can be extended from full space to subspace
clustering.
General Characteristics of Density-based methods :
– Can find arbitrarily shaped clusters
– Clusters are dense regions of objects in space that are separated by low-density regions
– Cluster density : Each point must have a minimum number of points within its
“neighborhood”
– May filter out outliers
Grid-based methods :
Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
All the clustering operations are performed on the grid structure (i.e., on the quantized
space).
The main advantage of this approach is its fast processing time.
Grid-based method is independent of the number of data objects and dependent only
on the number of cells in each dimension in the quantized space.
General Characteristics of Grid-based methods :
– Use a multi resolution grid data structure
– Fast processing time (typically independent of the number of
data objects, yet dependent on grid size)

You might also like