0% found this document useful (0 votes)
30 views152 pages

Unit 4 - Part 1

Uploaded by

Vandit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views152 pages

Unit 4 - Part 1

Uploaded by

Vandit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

Unit-4

Frequent Itemsets and Clustering: Mining frequent itemsets, market based


modelling, Apriori algorithm, handling large data sets in main memory, limited
pass algorithm, counting frequent itemsets in a stream, clustering techniques:
hierarchical, K-means, clustering high dimensional data, CLIQUE and ProCLUS,
frequent pattern based clustering methods, clustering in non- euclidean
space, clustering for streams and parallelism.
Frequent Itemsets and Clustering
• Frequent means happening often or occurring regularly.
• A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset.
• An itemset consists of two or more items. An itemset that occurs frequently is called a frequent itemset.
• A set of items is called frequent if it satisfies a minimum threshold value for support and confidence.
• Support shows transactions with items purchased together in a single transaction.
• Confidence shows transactions where the items are purchased one after the other.
• Thus frequent itemset mining is a data mining technique to identify the items that often occur together.
• For Example, Bread and butter, Laptop and Antivirus software, etc.
• Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear
in a data set frequently.

example, a set of items, such as milk and bread, that appear

• frequently together in a transaction data set is a frequent itemset.

• A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it
occurs frequently in a shopping history database, is a (frequent) sequential pattern.

• A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences.

• If a substructure occurs frequently, it is called a (frequent) structured pattern.

• Finding such frequent patterns plays an essential role in mining associations, correlations, and many
other interesting relationships among data.
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis.

• Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets, which help in many business decision-making processes,
on customer shopping behavior analysis.

• Example of frequent itemset mining is market basket analysis.


• It analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”
• help retailers develop marketing strategies
Market basket analysis
• We consider a universe as set of items available at store
• Each item is boolean variable representing presence/absent of item
• Each basket is represented as boolean vector
• These can be represented in form of association rules
• Association rules are interesting if they satisfy both minimum support and minimum confidence
• Computer🡪 anti virus [support=2%,confidence=50%]
• A support of 2% for Association Rule means that 2% of all the transactions under analysis show that
computer and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also bought the
software.
• Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold
• Such thresholds can be set by users or domain experts.
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought • Itemset X = {x1, …, xk}
10 A, B, D
• Find all the rules X 🡪 Y with
20 A, C, D
minimum support and confidence
30 A, D, E
• support, s, probability that a
40 B, E, F
transaction contains X ∪ Y
50 B, C, D, E, F
• confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper having X also contains Y

Let supmin = 50%, confmin = 50%


Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Customer
Association rules:
buys banana
A 🡪 D (60%, 100%)
D 🡪 A (60%, 75%)
Association Rule
1. What is an association rule?
• An implication expression of the form X → Y, where X and Y
are itemsets and X∩Y=∅
• Example:
{Milk, Diaper} → {Banana}
2. What is association rule mining?
• To find all the strong association rules
• An association rule r is strong if
• Support(r) ≥ min_sup
• Confidence(r) ≥ min_conf
• Rule Evaluation Metrics
• Support (s): Fraction of transactions that contain both X and Y

• Confidence (c): Measures how often items in Y appear in


transactions that contain X
Definition: Frequent Itemset
• Itemset
• A collection of one or more items
• Example: {Bread, Milk, Diaper}
• k-itemset
• An itemset that contains k items
• Support count (σ)
• # transactions containing an itemset
• E.g. σ({Bread, Milk, Diaper}) = 2
• Support (s)
• Fraction of transactions containing an itemset
• E.g. s({Bread, Milk, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a
min_sup threshold
Association Rule Mining Task
• An association rule r is strong if
• Support(r) ≥ min_sup
• Confidence(r) ≥ min_conf
• Given a transactions database D, the goal of association
rule mining is to find all strong rules
• Two-step approach:
1. Frequent Itemset Identification
– Find all itemsets whose support ≥ min_sup
2. Rule Generation
– From each frequent itemset, generate all confident
rules whose confidence ≥ min_conf
Example of Support and Confidence
To calculate the support and
confidence of rule
{Milk, Diaper} → {Banana}
• # of transactions: 5
• # of transactions containing
{Milk, Diaper, Banana}: 2
• Support: 2/5=0.4
• # of transactions containing
{Milk, Diaper}: 3
• Confidence: 2/3=0.67
Rule Generation
Suppose min_sup=0.3, min_conf=0.6,
Support({Banana, Diaper, Milk})=0.4
All candidate rules:
{Banana} → {Diaper, Milk} (s=0.4, c=0.67)
{Diaper} → {Banana, Milk} (s=0.4, c=0.5)
{Milk} → {Banana, Diaper} (s=0.4, c=0.5)
{Banana, Diaper} → {Milk} (s=0.4, c=0.67)
{Banana, Milk} → {Diaper} (s=0.4, c=0.67)
{Diaper, Milk} → {Banana} (s=0.4, c=0.67)

All non-empty real subsets Strong rules:


{Banana} , {Diaper} , {Milk}, {Banana} → {Diaper, Milk} (s=0.4, c=0.67)
{Banana, Diaper}, {Banana, Milk} , {Banana, Diaper} → {Milk} (s=0.4, c=0.67)
{Diaper, Milk} {Banana, Milk} → {Diaper} (s=0.4, c=0.67)
{Diaper, Milk} → {Banana} (s=0.4, c=0.67)
Frequent Itemset Identification: the Itemset Lattice
Level 0

Level 1

Level 2

Level 3

Level 4

Level 5
Given I items, there are
2I-1 candidate
itemsets!
Frequent Itemset Identification: Brute-Force Approach
• Brute-force approach:
• Set up a counter for each itemset in the lattice
• Scan the database once, for each transaction T,
• check for each itemset S whether T⊇ S
• if yes, increase the counter of S by 1

• Output the itemsets with a counter ≥ (min_sup*N)


• Complexity ~ O(NMw) Expensive since M = 2I-1 !!!
Step 1: Define Minimum Support Threshold
Assume a minimum support threshold of 2. This means that any itemset must appear in
at least 2 transactions to be considered frequent.
Step 2: Generate All Possible Itemsets
The brute-force approach examines every possible subset of items in the dataset. Here
are all the possible itemsets:
•Single items: {A}, {B}, {C}, {D}
•Pairs: {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}
•Triples: {A, B, C}, {A, B, D}, {A, C, D}, {B, C, D}
•Quadruple: {A, B, C, D}
• Step 3: Count the Occurrences of Each Itemset
• Now, count how many times each itemset appears in the dataset:
• Single items:
• {A}: 4
• {B}: 4
• {C}: 4
• {D}: 3
• Pairs:
• {A, B}: 3
• {A, C}: 3
• {A, D}: 2
• {B, C}: 3
• {B, D}: 3
• {C, D}: 3
• Triples:
• {A, B, C}: 2
• {A, B, D}: 2
• {A, C, D}: 1
• {B, C, D}: 2
• Quadruple:
• {A, B, C, D}: 1
• Step 4: Identify Frequent Itemsets
• Based on the minimum support threshold of 2, the frequent itemsets
are:
• Single items: {A}, {B}, {C}, {D}
• Pairs: {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}
• Triples: {A, B, C}, {A, B, D}, {B, C, D}
How to Get an Efficient Method?
• The complexity of a brute-force method is O(MNw)
• M=2I-1, I is the number of items
• How to get an efficient method?
• Reduce the number of candidate itemsets
• Check the supports of candidate itemsets efficiently
Anti-Monotone Property
• Any subset of a frequent itemset must be also frequent
— an anti-monotone property
• Any transaction containing {banana, diaper, milk} also
contains {banana, diaper}
• {banana, diaper, milk} is frequent 🡪 {banana, diaper} must
also be frequent
• In other words, any superset of an infrequent itemset
must also be infrequent
• No superset of any infrequent itemset should be generated
or tested
• Many item combinations can be pruned!
Apriori algorithm
• The Apriori algorithm is a widely used data mining technique for identifying
frequent itemsets in a dataset and generating association rules. It is often
applied in market basket analysis to find patterns in customer purchasing
behavior. The algorithm works on the principle that "all non-empty subsets
of a frequent itemset must also be frequent," which is known as the Apriori
property.
• Key Concepts:
• Support: The frequency of occurrence of an itemset in the dataset. It's
usually represented as a percentage of transactions containing the itemset.
• Confidence: A measure of the likelihood of one item being bought if
another is bought.
• Frequent Itemset: An itemset that meets a minimum support threshold.
Steps of the Apriori Algorithm
• Generate Candidate Itemsets:
• Start with a single item and find all possible single-item combinations that
meet the minimum support threshold. These are the initial frequent itemsets.
• Generate Larger Itemsets:
• Using the frequent itemsets from the previous step, generate larger itemsets
by combining them with other frequent itemsets. For example, if {A, B} and
{A, C} are frequent itemsets, then {A, B, C} can be generated if it meets the
support threshold.
• Pruning:
• Remove itemsets that do not meet the minimum support threshold. This step
helps reduce the search space and keep only the itemsets that are potentially
frequent.
1. Repeat Until No More Frequent Itemsets are Found:
1. Continue generating and pruning itemsets until no new frequent itemsets can
be created that meet the support threshold.
2. Generate Association Rules:
1. After finding all frequent itemsets, generate association rules by calculating
confidence for each possible rule. Only rules that meet a minimum
confidence threshold are kept.
Example of the Apriori Algorithm
• Step 1: Generate Candidate Itemsets of Size 1
• Minimum support threshold is set to 50% (meaning an itemset must
appear in at least 3 transactions).
• Calculate the support for each individual item:
• Milk: 4/5 = 80%
• Bread: 5/5 = 100%
• Butter: 3/5 = 60%
• Jam: 3/5 = 60%
• Since all items meet the support threshold, they are all frequent
itemsets.
• Step 2: Generate Candidate Itemsets of Size 2
• Combine frequent items from the previous step to form pairs:
• {Milk, Bread}: 4/5 = 80%
• {Milk, Butter}: 2/5 = 40% (not frequent, so we discard it)
• {Milk, Jam}: 2/5 = 40% (not frequent, so we discard it)
• {Bread, Butter}: 3/5 = 60%
• {Bread, Jam}: 3/5 = 60%
• {Butter, Jam}: 2/5 = 40% (not frequent, so we discard it)
The frequent itemsets of size 2 are {Milk, Bread}, {Bread, Butter}, and
{Bread, Jam}.
• Step 3: Generate Candidate Itemsets of Size 3
• Combine frequent pairs to form triples:
• {Milk, Bread, Butter}: 2/5 = 40% (not frequent, so we discard it)
• {Milk, Bread, Jam}: 2/5 = 40% (not frequent, so we discard it)
• {Bread, Butter, Jam}: 2/5 = 40% (not frequent, so we discard it)
• Since no itemsets of size 3 meet the support threshold, we stop here.
• Step 4: Generate Association Rules
• Now, we create association rules from our frequent itemsets, such as:
• {Milk} -> {Bread} with confidence = Support({Milk, Bread}) / Support({Milk}) = 80%
/ 80% = 100%
• {Bread} -> {Butter} with confidence = Support({Bread, Butter}) / Support({Bread}) =
60% / 100% = 60%
• {Bread} -> {Jam} with confidence = Support({Bread, Jam}) / Support({Bread}) = 60%
/ 100% = 60%
• Only association rules that meet a minimum confidence threshold would be
selected.
Illustrating Apriori Principle
Level 0

Level 1

Found to be
Infrequent

Pruned
Supersets
An Example

Min. support 50%


Min. confidence 50%

For rule A ⇒ C:
support = support({A λC}) = 50%
confidence = support({A λC})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
•Calculate Support for Single Items:
•Support({A}): Appears in 3/4 transactions = 75%
•Support({B}): Appears in 2/4 transactions = 50%
•Support({C}): Appears in 2/4 transactions = 50%
•Support({D}): Appears in 1/4 transactions = 25%
•Support({E}): Appears in 1/4 transactions = 25%
•Support({F}): Appears in 1/4 transactions = 25%
Only {A}, {B}, and {C} meet the minimum support of 50%.
•Generate Candidate Pairs and Calculate Support:
•Support({A, B}): Appears in 1/4 transactions = 25% (does not meet the support threshold)
•Support({A, C}): Appears in 2/4 transactions = 50% (meets the support threshold)
•Support({A, D}): Appears in 1/4 transactions = 25% (does not meet the support threshold)
•Support({B, C}): Appears in 1/4 transactions = 25% (does not meet the support threshold)
Only {A, C} meets the minimum support of 50%.
1.Frequent Itemsets:
•Frequent Single Items: {A}, {B}, {C}
•Frequent Pair: {A, C}
2.Generate Association Rules and Calculate Confidence:
•Rule A ⇒ C:
•Support({A, C}) = 50%
•Confidence = Support({A, C}) / Support({A}) = 50% / 75% = 66.6% (meets the confidence threshold)

Thus, the only frequent itemsets are {A}, {B}, {C}, and {A, C}, and the rule A ⇒ C has sufficient support and confidence.
Finding frequent itemsets using the Apriori
Algorithm: Example

TID List of Items • Consider a database D, consisting of 9


T100 I1, I2, I5
transactions.
• Each transaction is represented by an
T100 I2, I4
itemset.
T100 I2, I3 • Suppose min. support required is 2 (2
T100 I1, I2, I4
out of 9 = 2/9 =22 % )
• Say min. confidence required is 70%.
T100 I1, I3
• We have to first find out the frequent
T100 I2, I3 itemset using Apriori Algorithm.
T100 I1, I3 • Then, Association rules will be
generated using min. support & min.
T100 I1, I2 ,I3, I5
confidence.
T100 I1, I2, I3
Step 1: Generating candidate and frequent 1-itemsets with min.
support = 2
Compare candidate
Scan D for support count with
count of each Itemset Sup.Count Itemset Sup.Count
minimum support count
candidate
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2
C1 L1

▪ In the first iteration of the algorithm, each item is a member of the set of
candidates Ck along with its support count.
▪ The set of frequent 1-itemsets L1, consists of the candidate 1-itemsets
satisfying minimum support.
Step 2: Generating candidate and frequent 2-itemsets with min.
support = 2
Generate C2 Scan D for Compare
Itemset Itemset Sup. Itemset Sup
candidates count of candidate
{I1, I2} Count support count Count
from L1 x L1 each
candidate {I1, I2} 4 with minimum {I1, I2} 4
{I1, I3} support count
{I1, I4} {I1, I3} 4 {I1, I3} 4

{I1, I5} {I1, I4} 1 {I1, I5} 2

{I2, I3} {I1, I5} 2 {I2, I3} 4

{I2, I4} {I2, I4} 2


{I2, I3} 4
{I2, I5} {I2, I5} 2
{I2, I4} 2
{I3, I4} {I2, I5} 2 L2
{I3, I5} {I3, I4} 0
{I4, I5} Note: We haven’t used
{I3, I5} 1
Apriori Property yet!
C2 {I4, I5} 0

C2
Step 3: Generating candidate and frequent
3-itemsets with min. support = 2
Compare
Generate C3 Scan D for candidate
candidates count of support count
from L2 Itemset each Itemset Sup. Itemset Sup
with min
{I1, I2, I3} candidate Count support count Count
{I1, I2, I5} {I1, I2, I3} 2 {I1, I2, I3} 2
{I1, I3, I5} {I1, I2, I5} 2 {I1, I2, I5} 2
{I2, I3, I4}
C3 L3
{I2, I3, I5}
{I2, I4, I5}
Contains non-frequent
C3 (2-itemset) subsets

▪ The generation of the set of candidate 3-itemsets C3, involves use of the Apriori
Property.
▪ When Join step is complete, the Prune step will be used to reduce the size of C3.
Prune step helps to avoid heavy computation due to large Ck.
Step 4: Generating frequent 4-itemset
• L3 Join L3 C4 = {{I1, I2, I3, I5}}

• This itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.

• Thus, C4 = φ, and the algorithm terminates, having found all of


the frequent items.

• This completes our Apriori Algorithm. What’s Next ?

• These frequent itemsets will be used to generate strong


association rules (where strong association rules satisfy both
minimum support & minimum confidence).
Step 5: Generating Association Rules from frequent
k-itemsets
• Procedure:
• For each frequent itemset l, generate all nonempty subsets of l

• For every nonempty subset s of l, output the rule “s 🡪 (l - s)” if


support_count(l) / support_count(s) ≥ min_conf where min_conf is
minimum confidence threshold. 70% in our case.

• Back To Example:
• Lets take l = {I1,I2,I5}

• The nonempty subsets of Lets take l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2},
{I5}
Step 5: Generating Association Rules from frequent k-itemsets
[Cont.]

• The resulting association rules are:


• R1: I1 ^ I2 🡪 I5
• Confidence = sc{I1,I2,I5} / sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
• R2: I1 ^ I5 🡪 I2
• Confidence = sc{I1,I2,I5} / sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
• R3: I2 ^ I5 🡪 I1
• Confidence = sc{I1,I2,I5} / sc{I2,I5} = 2/2 = 100%
• R3 is Selected.
Step 5: Generating Association Rules from Frequent Itemsets
[Cont.]

• R4: I1 🡪 I2 ^ I5
• Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33%
• R4 is Rejected.
• R5: I2 🡪 I1 ^ I5
• Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29%
• R5 is Rejected.
• R6: I5 🡪 I1 ^ I2
• Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100%
• R6 is Selected.

We have found three strong association rules.


Handling large data sets in main memory
• A large volume of data poses new challenges, such as overloaded
memory and algorithms that never stop running.
• It forces you to adapt and expand your repertoire of techniques.
• But even when you can perform your analysis, you should take care of
issues such as I/O (input/output) and CPU starvation, because these
can cause speed issues
• A computer only has a limited amount of RAM. When you try to squeeze more data into this memory than actually fits, the
OS will start swapping out memory blocks to disks, which is far less efficient than having it all in memory. But only a few
algorithms are designed to handle large data sets; most of them load the whole data set into memory at once, which causes
the out-of-memory error. Other algorithms need to hold multiple copies of the data in memory or store intermediate results.
All of these aggravate the problem.

• Even when you cure the memory issues, you may need to deal with another limited resource: time. Although a computer
may think you live for millions of years, in reality you won’t (unless you go into cryostasis until your PC is done). Certain
algorithms don’t take time into account; they’ll keep running forever. Other algorithms can’t end in a reasonable amount of
time when they need to process only a few megabytes of data.

• A third thing you’ll observe when dealing with large data sets is that components of your computer can start to form a
bottleneck while leaving other systems idle. Although this isn’t as severe as a never-ending algorithm or out-of-memory
errors, it still incurs a serious cost. Think of the cost savings in terms of person days and computing infrastructure for CPU
starvation. Certain programs don’t feed data fast enough to the processor because they have to read data from the hard
drive, which is one of the slowest components on a computer. This has been addressed with the introduction of solid state
drives (SSD), but SSDs are still much more expensive than the slower and more widespread hard disk drive (HDD)
technology.
General techniques for handling large
volumes of data
• Never-ending algorithms, out-of-memory errors, and speed
issues are the most common challenges you face when working
with large data. In this section, we’ll investigate solutions to
overcome or alleviate these problems.
• The solutions can be divided into three categories: using the
correct algorithms, choosing the right data structure, and using
the right tools
• No clear one-to-one mapping exists between the problems and solutions
because many solutions address both lack of memory and computational
performance.
• For instance, data set compression will help you solve memory issues because
the data set becomes smaller.
• But this also affects computation speed with a shift from the slow hard disk to
the fast CPU.
• Contrary to RAM (random access memory), the hard disc will store everything
even after the power goes down, but writing to disc costs more time than
changing information in the fleeting RAM.
• When constantly changing the information, RAM is thus preferable over the
(more durable) hard disc.
• With an unpacked data set, numerous read and write operations (I/O) are
occurring, but the CPU remains largely idle, whereas with the compressed data
set the CPU gets its fair share of the workload.
CHOOSING THE RIGHT ALGORITHM
• Choosing the right algorithm can solve more problems than adding
more or better hardware.
• An algorithm that’s well suited for handling large data doesn’t need to
load the entire data set into memory to make predictions.
• Ideally, the algorithm also supports parallelized calculations.
• Three types of algorithms that can do that: online algorithms, block
algorithms, and MapReduce algorithms
Online learning algorithms

• Several, but not all, machine learning algorithms can be trained using one
observation at a time instead of taking all the data into memory.
• Upon the arrival of a new data point, the model is trained and the observation
can be forgotten; its effect is now incorporated into the model’s parameters.
• For example, a model used to predict the weather can use different parameters
(like atmospheric pressure or temperature) in different regions.
• When the data from one region is loaded into the algorithm, it forgets about
this raw data and moves on to the next region.
• This “use and forget” way of working is the perfect solution for the memory
problem as a single observation is unlikely to ever be big enough to fill up all
the memory of a modern-day computer.
Main Memory
Handling Larger Datasets in Main Memory
• The A-Priori Algorithm is fine as long as the step with the greatest
requirement for main memory – typically the counting of the
candidate pairs C2– has enough memory that it can be accomplished
without thrashing (repeated moving of data between disk and main
memory). Several algorithms have been proposed to cut down on the
size of candidate set C2.
• Here, we consider the PCY Algorithm, which takes advantage of the
fact that in the first pass of A-Priori there is typically lots of main
memory not needed for the counting of single items.
• Then we look at the Multistage Algorithm, which uses the PCY trick
and also inserts extra passes to further reduce the size of C2
PCY (Park-Chen-Yu) Algorithm
• The PCY (Park-Chen-Yu) algorithm is a method used in data
analytics to identify frequent itemsets within large datasets efficiently.
• It is particularly useful in market basket analysis, where it helps in
discovering combinations of items frequently purchased together,
such as shirts and jeans.
• The algorithm improves performance by using a two-phase
approach:
• first, it hashes pairs of items to count their occurrences and uses a hash table
to reduce the number of candidate pairs, and
• Second, it scans the dataset again to determine the actual frequent itemsets.
• If there are a million items and gigabytes of main memory, we do not need
more than 10% of the main memory.
• The PCY algorithm uses hashing to efficiently count item set
frequencies and reduce overall computational cost.
• The basic idea is to use a hash function to map itemsets to
hash buckets, followed by a hash table to count the frequency
of itemsets in each bucket.
• Problem:Apply the PCY algorithm on the following transaction to find the candidate sets
(frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
• Step 1: Find the frequency of each element and remove the candidate set having
length 1.
Step 2: One by one transaction-wise, create all the possible pairs and
corresponding to them write their frequency. Note - Note: Pairs should not get
repeated avoid the pairs that are already written before.
Step 3: List all sets whose length is greater than or equal to the threshold and
then apply Hash Functions. (It gives us the bucket number). It defines in what
bucket this particular pair will be put.
• Step 4: This is the last step, and in this step, we have to create a table with the
following details –
• Bit vector - if the frequency of the candidate pair is greater than equal to the
threshold then the bit vector is 1 otherwise 0. (mostly 1)
• Bucket number - found in the previous step

• Maximum number of support - frequency of this candidate pair, found in step 2.

• Correct - the candidate pair will be mentioned here.

• Candidate set - if the bit vector is 1, then "correct" will be written here.
• Step 1: Find the frequency of • Step 2: One by one
each element and remove the transaction-wise, create all
candidate set having length the possible pairs and
1. corresponding to it write its
frequency.
• Step 3: List all sets whose length is greater than the threshold
and then apply Hash Functions. (It gives us the bucket number).
• Hash Function = ( i * j) mod 10
• (1, 3) = (1*3) mod 10 = 3
• (2,3) = (2*3) mod 10 = 6
• (2,4) = (2*4) mod 10 = 8
• (3,4) = (3*4) mod 10 = 2
• (3,5) = (3*5) mod 10 = 5
• (4,5) = (4*5) mod 10 = 0
• (4,6) = (4*6) mod 10 = 4
• Step 4: Prepare candidate set
The SON Algorithm and Map – Reduce
• The SON algorithm work well in a parallel-computing
environment.
• Each of the chunks can be processed in parallel, and the
frequent itemsets from each chunk combined to form the
candidates.
• We can distribute the candidates to many processors, have
each processor count the support for each candidate in a
subset of the baskets, and finally sum those supports to get the
support for each candidate itemset in the whole dataset.
• There is a natural way of expressing each of the two passes as
a MapReduce operation
MapReduce-MapReduce sequence
• First Map function :-
• Take the assigned subset of the baskets and find the itemsets frequent in the
subset using the simple and randomized algorithm.
• Lower the support threshold from s to ps if each Map task gets fraction p of
the total input file.
• The output is a set of key-value pairs (F, 1), where F is a frequent itemset
from the sample.
• First Reduce Function :-
• Each Reduce task is assigned a set of keys, which are itemsets.
• The value is ignored, and the Reduce task simply produces those keys
(itemsets) that appear one or more times.
• Thus, the output of the first Reduce function is the candidate itemsets.
• Second Map function :-
• The Map tasks for the second Map function take all the output from the first Reduce
Function (the candidate itemsets) and a portion of the input data file.
• Each Map task counts the number of occurrences of each of the candidate itemsets
among the baskets in the portion of the dataset that it was assigned.
• The output is a set of key-value pairs (C, v), where C is one of the candidate sets
and v is the support for that itemset among the baskets that were input to this Map
task.
• Second Reduce function :-
• The Reduce tasks take the itemsets they are given as keys and sum the associated
values.
• The result is the total support for each of the itemsets that the Reduce task was
assigned to handle.
• Those itemsets whose sum of values is at least s are frequent in the whole dataset,
so the Reduce task outputs these itemsets with their counts.
• Itemsets that do not have total support at least s are not transmitted to the output of
the Reduce task.
Clustering Techniques
What is Cluster Analysis?
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined
classes
General Applications of Clustering

• Pattern Recognition
• Spatial Data Analysis
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
• Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their


customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation.
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Requirements of Clustering in Data Mining

• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Incorporation of user-specified constraints
• Interpretability and usability
Type of data in clustering analysis

• Interval-scaled variables:

• Binary variables:

• Nominal, ordinal, and ratio variables:

• Variables of mixed types:


Interval-valued variables

• Standardize data
• Calculate the mean absolute deviation:

where
• Calculate the standardized measurement (z-score)

• Using mean absolute deviation is more robust than using


standard deviation
Similarity and Dissimilarity Between
Objects

• Distances are normally used to measure the similarity or


dissimilarity between two data objects
• Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
Similarity and Dissimilarity Between
Objects (Cont.)

• If q = 2, d is Euclidean distance:

• Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)

• Also one can use weighted distance, parametric Pearson


product moment correlation, or other disimilarity measures.
Binary Variables
• A contingency table for binary data
Object j

Object i

• Simple matching coefficient (invariant, if the binary variable is


symmetric):
• Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
Dissimilarity between Binary
Variables
• Example

• gender is a symmetric attribute


• the remaining attributes are asymmetric binary
• let the values Y and P be set to 1, and the value N be set to 0
Nominal Variables

• A generalization of the binary variable in that it can take more


than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
• m: # of matches, p: total # of variables

• Method 2: use a large number of binary variables


• creating a new binary variable for each of the M nominal
states
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
• replacing xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by

• compute the dissimilarity using methods for interval-scaled


variables
Ratio-Scaled Variables

• Ratio-scaled variable: a positive measurement on a nonlinear


scale, approximately at exponential scale, such as AeBt or
Ae-Bt
• Methods:
• treat them like interval-scaled variables — not a good choice!
(why?)
• apply logarithmic transformation
yif = log(xif)
• treat them as continuous ordinal data treat their rank as
interval-scaled.
Variables of Mixed Types
• A database may contain all the six types of variables
• symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
• One may use a weighted formula to combine their effects.

• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
• f is interval-based: use the normalized distance
• f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled
Major Clustering Approaches

• Partitioning algorithms: Construct various partitions and then


evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other
Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D of n


objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in 4 steps:


• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean
point) of the cluster.
• Assign each object to the cluster with the nearest seed
point.
• Go back to Step 2, stop when no more new assignment.
The K-Means Clustering Method
• Example
Comments on the K-Means Method

• Strength
• Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
• Applicable only when mean is defined, then what about
categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
• A few variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical
objects
• Using a frequency-based method to update modes of clusters
• A mixture of categorical and numerical data: k-prototype
method
The K-Medoids Clustering Method

• Find representative objects, called medoids, in clusters


• PAM (Partitioning Around Medoids, 1987)
• starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale
well for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
PAM (Partitioning Around Medoids) (1987)

• PAM (Kaufman and Rousseeuw, 1987), built in Splus


• Use real object to represent the cluster
• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
• repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost TCih=∑jCjih

j
t t
j
i h h
i

h
j
i
i h j
t
t
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a
termination condition

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative


(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of


clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the


desired level, then each connected component forms a cluster.
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
More on Hierarchical Clustering Methods
• Major weakness of agglomerative clustering methods
• do not scale well: time complexity of at least O(n2), where n is
the number of total objects
• can never undo what was done previously
• Integration of hierarchical with distance-based clustering
• BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
• CURE (1998): selects well-scattered points from the cluster
and then shrinks them towards the center of the cluster by a
specified fraction
• CHAMELEON (1999): hierarchical clustering using dynamic
modeling
BIRCH (1996)
• Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve the
inherent clustering structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
• Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order
of the data record.
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: ∑Ni=1=Xi
SS: ∑Ni=1=Xi2
CF = (5, (16,30),(54,190))

(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node

CF1 CF2 CF3 CF5

child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
CURE (Clustering Using
REpresentatives )

• CURE: proposed by Guha, Rastogi & Shim, 1998


• Stops the creation of a cluster hierarchy if a level consists of
k clusters
• Uses multiple representative points to evaluate the distance
between clusters, adjusts well to arbitrary shaped clusters
and avoids single-link effect
Drawbacks of Distance-Based
Method

• Drawbacks of square-error based clustering method


• Consider only one point as representative of a cluster
• Good only for convex shaped, similar size and density, and
if k can be reasonably estimated
Cure: The Algorithm
• Draw random sample s.
• Partition sample to p partitions with size s/p
• Partially cluster partitions into s/pq clusters
• Eliminate outliers
• By random sampling
• If a cluster grows too slow, eliminate it.

• Cluster partial clusters.


• Label data in disk
Data Partitioning and Clustering
• s = 50
•p=2 s/pq = 5

• s/p = 25

y
y y

y y

x x

x x
Cure: Shrinking Representative Points
y
y

x
x

• Shrink the multiple representative points towards the gravity


center by a fraction of α.
• Multiple representatives capture the shape of the cluster
Clustering Categorical Data: ROCK
• ROCK: Robust Clustering using linKs,
by S. Guha, R. Rastogi, K. Shim (ICDE’99).
• Use links to measure similarity/proximity
• Not distance based
• Computational complexity:
• Basic ideas:
• Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
Rock: Algorithm
• Links: The number of common neighbours for the two
points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} 3
{1,2,4}
• Algorithm
• Draw random sample
• Cluster with links
• Label data in disk
CHAMELEON

• CHAMELEON: hierarchical clustering using dynamic modeling,


by G. Karypis, E.H. Han and V. Kumar’99
• Measures the similarity based on a dynamic model
• Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high
relative to the internal interconnectivity of the clusters
and closeness of items within the clusters
• A two phase algorithm
• 1. Use a graph partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
• 2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
• Several interesting studies:
• DBSCAN: Ester, et al. (KDD’96)
• OPTICS: Ankerst, et al (SIGMOD’99).
• DENCLUE: Hinneburg & D. Keim (KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98)
Density-Based Clustering: Background
• Two parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-neighbourhood
of that point
• NEps(p): {q belongs to D | dist(p,q) <= Eps}
• Directly density-reachable: A point p is directly density-reachable
from a point q wrt. Eps, MinPts if
• 1) p belongs to NEps(q)
• 2) core point condition: p MinPts = 5
q Eps = 1 cm
|NEps (q)| >= MinPts
Density-Based Clustering: Background (II)

• Density-reachable:
p
• A point p is density-reachable from a
point q wrt. Eps, MinPts if there is a p1
q
chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
• Density-connected
• A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o wrt. Eps and
MinPts.
DBSCAN: Density Based Spatial Clustering
of Applications with Noise

• Relies on a density-based notion of cluster: A cluster is defined


as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise

Outlier

Border
Eps = 1cm
MinPts = 5
Core
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p wrt Eps and
MinPts.
• If p is a core point, a cluster is formed.
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database.
• Continue the process until all of the points have been
processed.
OPTICS: A Cluster-Ordering Method (1999)

• OPTICS: Ordering Points To Identify the Clustering Structure


• Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
• Produces a special order of the database wrt its
density-based clustering structure
• This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad range of
parameter settings
• Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
• Can be represented graphically or using visualization
techniques
OPTICS: Some Extension from DBSCAN

• Index-based:
• k = number of dimensions
• N = 20
• p = 75%
• M = N(1-p) = 5 D

• Complexity: O(kN2)
• Core Distance
p1

• Reachability Distance
o
p2
o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
ε = 3 cm
Reachability-dis
tance

undefined

Cluster-order
of the objects
DENCLUE: using density functions
• DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
• Major features
• Solid mathematical foundation
• Good for data sets with large amounts of noise
• Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
• Significant faster than existing algorithm (faster than DBSCAN
by a factor of up to 45)
• But needs a large number of parameters
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a
tree-based access structure.
• Influence function: describes the impact of a data point within its
neighborhood.
• Overall density of the data space can be calculated as the sum of
the influence function of all data points.
• Clusters can be determined mathematically by identifying density
attractors.
• Density attractors are local maximal of the overall density function.
Gradient: The steepness of a slope

• Example
Density Attractor
Center-Defined and Arbitrary
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical Information Grid
Approach

• Wang, Yang and Muntz (VLDB’97)


• The spatial area area is divided into rectangular cells
• There are several levels of cells corresponding to different levels
of resolution
STING: A Statistical Information Grid
Approach (2)

• Each cell at a high level is partitioned into a number of smaller


cells in the next lower level
• Statistical info of each cell is calculated and stored beforehand and
is used to answer queries
• Parameters of higher level cells can be easily calculated from
parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of
cells
• For each cell in the current level compute the confidence interval
STING: A Statistical Information Grid
Approach (3)
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the
next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
• All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected
WaveCluster (1998)
• Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach which applies
wavelet transform to the feature space
• A wavelet transform is a signal processing technique
that decomposes a signal into different frequency
sub-band.
• Both grid-based and density-based
• Input parameters:
• # of grid cells for each dimension
• the wavelet, and the # of applications of wavelet
transform.
What is Wavelet (1)?
WaveCluster (1998)
• How to apply wavelet transform to find clusters
• Summaries the data by imposing a multidimensional grid
structure onto data space
• These multidimensional spatial data objects are
represented in a n-dimensional feature space
• Apply wavelet transform on feature space to find the
dense regions in the feature space
• Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
What Is Wavelet (2)?
Quantization
Transformation
WaveCluster (1998)
• Why is wavelet transformation useful for clustering
• Unsupervised clustering
It uses hat-shape filters to emphasize region where points
cluster, but simultaneously to suppress weaker
information in their boundary
• Effective removal of outliers
• Multi-resolution
• Cost efficiency
• Major features:
• Complexity O(N)
• Detect arbitrary shaped clusters at different scales
• Not sensitive to noise, not sensitive to input order
• Only applicable to low dimensional data
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
• It partitions each dimension into the same number of equal
length interval
• It partitions an m-dimensional data space into non-overlapping
rectangular units
• A unit is dense if the fraction of total data points contained in
the unit exceeds the input model parameter
• A cluster is a maximal set of connected dense units within a
subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters:
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of
interests.
• Generate minimal description for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determination of minimal cover for each cluster
Salary
(10,000)

τ=3
0 1 2 3 4 5 6 7

20
30
40

S
50

ala
r
Vacation
y
60
age

30

Vacation
(week)
50

0 1 2 3 4 5 6 7
20
30
40

age
50
60
age
Strength and Weakness of CLIQUE

• Strength
• It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
• It is insensitive to the order of records in input and does
not presume some canonical data distribution
• It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
• The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Model-Based Clustering Methods
• Attempt to optimize the fit between the data and some
mathematical model
• Statistical and AI approach
• Conceptual clustering
• A form of clustering in machine learning
• Produces a classification scheme for a set of unlabeled objects
• Finds characteristic description for each concept (class)
• COBWEB (Fisher’87)
• A popular a simple method of incremental conceptual learning
• Creates a hierarchical clustering in the form of a classification tree
• Each node refers to a concept and contains a probabilistic description
of that concept
COBWEB Clustering Method

A classification tree
More on Statistical-Based Clustering
• Limitations of COBWEB
• The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
• Not suitable for clustering large database data – skewed
tree and expensive probability distributions
• CLASSIT
• an extension of COBWEB for incremental clustering of
continuous data
• suffers similar problems as COBWEB
• AutoClass (Cheeseman and Stutz, 1996)
• Uses Bayesian statistical analysis to estimate the number
of clusters
• Popular in industry
Other Model-Based Clustering
Methods
• Neural network approaches
• Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
• New objects are distributed to the cluster whose
exemplar is the most similar according to some dostance
measure
• Competitive learning
• Involves a hierarchical architecture of several units
(neurons)
• Neurons compete in a “winner-takes-all” fashion for the
object currently being presented
Model-Based Clustering Methods
Self-organizing feature maps (SOMs)

• Clustering is also performed by having several units


competing for the current object
• The unit whose weight vector is closest to the
current object wins
• The winner and its neighbors learn by having their
weights adjusted
• SOMs are believed to resemble processing that can
occur in the brain
• Useful for visualizing high-dimensional data in 2- or
3-D space
What Is Outlier Discovery?
• What are outliers?
• The set of objects are considerably dissimilar from the
remainder of the data
• Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem
• Find top n outlier points
• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Customer segmentation
• Medical analysis
Outlier Discovery:
Statistical
Approaches

● Assume a model underlying distribution that generates data


set (e.g. normal distribution)
• Use discordancy tests depending on
• data distribution
• distribution parameter (e.g., mean, variance)
• number of expected outliers
• Drawbacks
• most tests are for single attribute
• In many cases, data distribution may not be known
Outlier Discovery:
Distance-Based Approach
• Introduced to counter the main limitations imposed by
statistical methods
• We need multi-dimensional analysis without knowing data
distribution.
• Distance-based outlier: A DB(p, D)-outlier is an object O in a
dataset T such that at least a fraction p of the objects in T lies
at a distance greater than D from O
• Algorithms for mining distance-based outliers
• Index-based algorithm
• Nested-loop algorithm
• Cell-based algorithm
Outlier Discovery:
Deviation-Based Approach
• Identifies outliers by examining the main characteristics of
objects in a group
• Objects that “deviate” from this description are considered
outliers
• sequential exception technique
• simulates the way in which humans can distinguish
unusual objects from among a series of supposedly like
objects
• OLAP data cube technique
• uses data cubes to identify regions of anomalies in large
multidimensional data
Problems and Challenges

• Considerable progress has been made in scalable clustering


methods
• Partitioning: k-means, k-medoids, CLARANS
• Hierarchical: BIRCH, CURE
• Density-based: DBSCAN, CLIQUE, OPTICS
• Grid-based: STING, WaveCluster
• Model-based: Autoclass, Denclue, Cobweb
• Current clustering techniques do not address all the
requirements adequately
• Constraint-based clustering analysis: Constraints exist in data
space (bridges and highways) or in user queries
Constraint-Based Clustering Analysis

• Clustering analysis: less parameters but more user-desired


constraints, e.g., an ATM allocation problem

You might also like