0% found this document useful (0 votes)
1 views

unit-4.pptx

The document discusses Association Rule Mining (ARM) and Market Basket Analysis (MBA), which are techniques used to identify relationships between items in a dataset to analyze customer purchasing behavior. It explains the Apriori algorithm, which identifies frequent itemsets and calculates support and confidence to generate association rules. Additionally, it covers clustering methods, particularly K-means clustering, and various distance measures used in data mining.

Uploaded by

ashmakhan8855
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

unit-4.pptx

The document discusses Association Rule Mining (ARM) and Market Basket Analysis (MBA), which are techniques used to identify relationships between items in a dataset to analyze customer purchasing behavior. It explains the Apriori algorithm, which identifies frequent itemsets and calculates support and confidence to generate association rules. Additionally, it covers clustering methods, particularly K-means clustering, and various distance measures used in data mining.

Uploaded by

ashmakhan8855
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Unit-4

Frequent Itemsets & Clustering


Topic 1: Association Rule Mining (ARM) & Market Based Analysis (MBA)

• ARM searches for interesting relationship among items in given data set.
• MBA is mathematical modelling technique based upon theory that if you buy certain group of
items, you are likely to buy another group of items.
• The set of items a customer buy is referred as itemset & MBA seeks to find relationship
between purchases.
• It is used to analyse customer purchasing behavior and helps in increasing the sales &
maintain inventory.
• In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together, seeking
to find associations that occur far more often than expected.

• Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent itemsets are
used to find k+1 itemsets.

• Market Basket Analysis is modelled on Association rule mining, i.e., the IF {}, THEN {}
construct. For example, IF a customer buys bread, THEN he is likely to buy butter as well.
Association rules are usually represented as: {Bread} -> {Butter}

• The Apriori algorithm : It identifies frequent items in the database and then evaluates their
frequency as the datasets are expanded to larger sizes.
Apriori Property

“All subsets of a frequent itemset must be frequent(Apriori property).


If an itemset is infrequent, all its supersets will be infrequent.”

Components of Apriori Algorithm


1. Support
2. Confidence
Suppose you have 4000 customers transactions in a Big Bazar.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate,
and these 600 transactions include a 200 that includes Biscuits and chocolates.
Support
Support refers to the default popularity of any product. You find the support as a quotient of the
division of the number of transactions comprising that product by the total number of
transactions.
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
Confidence = Transactions relating both biscuits and Chocolate) / Total
transactions involving Biscuits)
= 200/400 = 50 percent.
SUPPORT
It has been calculated with the number of transactions divided by the total number
of transactions made,
Support = freqA, B/N
support(pen) = transactions related to pen/total transactions
i.e support → 500/500010 percent

CONFIDENCE
Whether the product sales are popular on individual sales or through combined
sales has been calculated. That is calculated with combined transactions/individual
transactions.

Confidence = freq A, B/ freq(A


Confidence = combine transactions/individual transactions
i.e confidence→ 1000/50020 percent
Apriori Algorithm

Apriori property: All subsets of a frequent itemset must be frequent(Apriori property).


If an itemset is infrequent, all its supersets will be infrequent.
Question:
Step-1: K=1
minimum support count is 2 (I) Create a table containing support
minimum confidence is 60% count of each item present in
dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us
itemset L1.

Step-2: K=2
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.

Step-3:
Step-4: We stop here because no frequent itemsets are found further
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
I={1,2,3} and {1,2,5}
S={1,2,3,(1,2),(1,3),(2,3)} and {1,2,5,(1,2),(1,5),(2,5)}
Rule : S🡪 (I-S)
SO rules can be

[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%


[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Tutorial
Step-1: Calculating C1 and L1:
•In the first step, we will create a table that contains support count The frequency of each itemset individually
in the dataset) of each itemset in the given dataset. This table is called the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the Minimum Support 2. It will
give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support, except the E, so E
itemset will be removed.

Step-2: Candidate Generation C2, and L2:


Again, we need to compare the C2 Support count with the minimum support count, and after comparing, the
itemset with less support count will be eliminated from the table C2. It will give us the below table for L2

Step-3: Candidate generation C3, and L3:


Step-4: Finding the association rules for the subsets:

Rules Support Confidence


A ^B  C 2 Sup{(A ^B) ^C/sup(A ^B)=
2/40.550%
B^C  A 2 Sup{(B^C) ^A/sup(B ^C)=
2/40.550%
A^C  B 2 Sup{(A ^C) ^B/sup(A ^C)=
2/40.550%
C A ^B 2 Sup{(C^( A ^B)}/sup(C 2/50.440%

A B^C 2 Sup{(A^( B ^C)}/sup(A


2/60.3333.33%
B B^C 2 Sup{(B^( B ^C)}/sup(B
2/70.2828%
Handling large data set in main Memory- PCY Algorithm(Park Chen Yu)
Handling large data set in main Memory- Multistage Algorithm
Handling large data set in main Memory- Multihash Algorithm
Limited Pass Algorithm
Simple Randomized Algorithm
Simple Randomized Algorithm
Avoiding Errors in Sampling Algorithm

• It cannot be relied upon either to produce all itemset that are frequent in whole
dataset, nor will it produce only itemset that are frequent in whole.

• An itemset i.e frequent in whole but not in sample is false Negative while
itemset i.e frequent in sample but not the whole is false positive.

• We can eliminate False positive by making pass through full dataset & counting
all itemset that were identified as frequent in sample & also frequent in whole.

• We cannot eliminate false negative completely but we can reduce their number
if amount of main memory allows it.
The Algorithm of Savasere, Omiecinski, and Navathe(SON)

• Improvement algorithm.
• Avoid both False Negatives and False Positives, at the cost of making two full
passes.
• Idea is to divide input file into chunks.
• Treat each chunk as sample & run algorithm on that chunk.
• Once all chunk processed take union of all itemset that have been found frequent
for one or more chunk. These are candidate itemsets.
• Every itemset i.e frequent in whole is frequent in atleast one chunk & we can be
sure that all truly frequent itemset are among the candidate i.e there are No False
negative
The SON Algorithm and MapReduces
Toivonens Algorithm

• One pass over a small sample and One full pass over the data.
• Avoid both FN and FP, but there is a small probability that it will fail to produce any
answer at all.

1st pass: candidates


1. select a small sample.
2. use a smaller threshold, such as 0.9ps, to find candidate frequent itemsets
3. construct the Negative Border:
//They are not frequent in the sample, but all of their immediate subsets(subsets constructed by
deleting exactly one item) are frequent in the sample.//

2nd pass: check, counting all $F$ and $N$.


4. if no member of NB is frequent in the whole datasets. output the frequent in whole.
5. otherwise, give no answer and resample again.
The Algorithm of Savasere, Omiecinski, and Navathe(SON)

Avoid both Fasle Negatives and False Positives, at the cost of making two full passes.

1st pass to find candidates.

1. Divide the input files into chunks.


2. Treat each chunks as sample, use $ps$ as the threshold.
3. candidate itemsets: the union of all the itemsets that have been found frequent for one or
more chunks.
idea: every itemset that's frequent in the whole is frequent in at least one chunk.

2nd pass to count all the candidates and check.


The SON Algorithm and MapReduces

First Map Function :


First Reduce Function :
Second Map Function :
Second Reduce Function :

1.First Map Function: $(F,1)$, where $F$ is a frequent itemset from the sample.
2.First Reduce Function: combine all the $F$ to construct the candidate itemsets.
3.Second Map Function: $(C,v)$, where $C$ is one of the candidate sets and $v$ is the support.
4.Second Reduce Function: Sum and filter out the frequent itemsets.
Toivonens Algorithm

pass over a small sample and one full pass over the data.
avoid both FN and FP, but there is a small probability that it will fail to produce any
answer at all.
1.1st pass: candidates
1. select a small sample.
2. use a smaller threshold, such as $0.9ps$, to find candidate frequent itemsets $F$.
3. construct the negative border($N$):
They are not frequent in the sample, but all of their immediate subsets(subsets constructed by
deleting exactly one item) are frequent in the sample.
2.2nd pass: check, counting all $F$ and $N$.
1. if no member of $N$ is frequent in the whole datasets. $to$ output the $F$.
2. otherwise, give no answer and resample again.
Why it works.
1.eliminate FP $gets$ check in the full datasets.
2.eliminate FN(namely, find all real frequent itemset in the sample):
Clustering

It is basically a type of unsupervised learning method.

Clustering is the task of dividing the unlabeled data or data points into different clusters such that
similar data points fall in the same cluster than those which differ from the others.
Partitioning Clustering Density Based Clustering

Distribution Model Based Clustering


Hierarchical Clustering
K-means Clustering

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters.
K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
K-means Algorithm

Step-1 Select the number K to decide the number of


clusters
Step-2 .Select random K points or centroids. It can be other from the
input dataset).
Step-3 Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-4 Calculate the variance and place a new centroid of each
cluster.
Step-5 Repeat the third steps, which means reassign each datapoint
to the new closest centroid of each cluster.
Step-6 If any reassignment occurs, then go to step-4 else go to
FINISH.
Step-7: The model is ready.
Measures of Distance in Data Mining
1. Euclidean Distance:

2. Manhattan Distance:

3. Minkowski distance:
Euclidean Distance:

• It can be simply explained as the ordinary distance between two points.


• It is one of the most used algorithms in the cluster analysis.
• One of the algorithms that use this formula would be K-mean.
• Mathematically it computes the root of squared differences between the
coordinates between two objects
Solved Example
Que: Given k=3 Point Coordinates
A1 (2,10)
A2 (2,6)
A3 (11,11)
A4 (6,9)
A5 (6,4)
A6 (1,2)
A7 (5,10)
A8 (4,9)
A9 (10,12)
A10 (7,5)
A11 (9,11)
A12 (4,6)
A13 (3,10)
A14 (3,8)
A15 (6,11)
Distance
Distance from Distance from
from Assigned
Point Centroid 1 Centroid 3
Centroid 2 Cluster
(2,6) (6,11)
(5,10)
A1 (2,10) 4 3 4.123106 Cluster 2
A2 (2,6) 0 5 6.403124 Cluster 1
A3 (11,11) 10.29563 6.082763 5 Cluster 3
A4 (6,9) 5 1.414214 2 Cluster 2
A5 (6,4) 4.472136 6.082763 7 Cluster 1
A6 (1,2) 4.123106 8.944272 10.29563 Cluster 1
A7 (5,10) 5 0 1.414214 Cluster 2
A8 (4,9) 3.605551 1.414214 2.828427 Cluster 2
A9 (10,12) 10 5.385165 4.123106 Cluster 3
A10 (7,5) 5.09902 5.385165 6.082763 Cluster 1
A11 (9,11) 8.602325 4.123106 3 Cluster 3
Results
A12 (4,6) 2 4.123106 5.385165 Cluster 1
from
1st iteration A13 (3,10) 4.123106 2 3.162278 Cluster 2
of K means A14 (3,8) 2.236068 2.828427 4.242641 Cluster 1
clustering A15 (6,11) 6.403124 1.414214 0 Cluster 3
Distance from Distance from Distance from
Assigned
Point Centroid 1 centroid 2 (4, centroid 3 (9,
Cluster
(3.833, 5.167) 9.6) 11.25)
A1 (2,10) 5.169 2.040 7.111 Cluster 2
A2 (2,6) 2.013 4.118 8.750 Cluster 1
A3 (11,11) 9.241 7.139 2.016 Cluster 3
A4 (6,9) 4.403 2.088 3.750 Cluster 2
A5 (6,4) 2.461 5.946 7.846 Cluster 1
Results from A6 (1,2) 4.249 8.171 12.230 Cluster 1
2nd iteration A7 (5,10) 4.972 1.077 4.191 Cluster 2
of K means
A8 (4,9) 3.837 0.600 5.483 Cluster 2
clustering
A9 (10,12) 9.204 6.462 1.250 Cluster 3
A10 (7,5) 3.171 5.492 6.562 Cluster 1
A11 (9,11) 7.792 5.192 0.250 Cluster 3
A12 (4,6) 0.850 3.600 7.250 Cluster 1
A13 (3,10) 4.904 1.077 6.129 Cluster 2
A14 (3,8) 2.953 1.887 6.824 Cluster 2
A15 (6,11) 6.223 2.441 3.010 Cluster 2
Distance from Distance from Distance from
Assigned
Point Centroid 1 (4, centroid 2 centroid 3 (10,
Cluster
4.6) (4.143, 9.571) 11.333)
A1 (2,10) 5.758 2.186 8.110 Cluster 2
A2 (2,6) 2.441 4.165 9.615 Cluster 1
A3 (11,11) 9.485 7.004 1.054 Cluster 3
A4 (6,9) 4.833 1.943 4.631 Cluster 2
A5 (6,4) 2.088 5.872 8.353 Cluster 1
Results from A6 (1,2) 3.970 8.197 12.966 Cluster 1
3rd iteration A7 (5,10) 5.492 0.958 5.175 Cluster 2
of K means A8 (4,9) 4.400 0.589 6.438 Cluster 2
clustering A9 (10,12) 9.527 6.341 0.667 Cluster 3
A10 (7,5) 3.027 5.390 7.008 Cluster 1
A11 (9,11) 8.122 5.063 1.054 Cluster 3
A12 (4,6) 1.400 3.574 8.028 Cluster 1
A13 (3,10) 5.492 1.221 7.126 Cluster 2
A14 (3,8) 3.544 1.943 7.753 Cluster 2
A15 (6,11) 6.705 2.343 4.014 Cluster 2
Tutorial
K-Medoids clustering
Example:
Hierarchical Clustering in Machine Learning

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to


group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.

The hierarchical clustering technique has two approaches:


1.Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.
2.Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.
Question. Find the clusters using a single link technique. Use Euclidean distance and
draw the dendrogram.
Tutorial
1. Complete this matrix and create dendrogram 2.Implement k-Medoids
Clustering High Dimesnional Data: CLIQUE and ProCLUS

CLIQUE(Clustering in QUEst)

CLIQUE is a density-based and grid-based subspace clustering algorithm. So let’s


first take a look at what is a grid and density-based clustering technique.

•Grid-Based Clustering Technique: In Grid-Based Methods, the space of instance


is divided into a grid structure. Clustering techniques are then applied using the
Cells of the grid, instead of individual data points, as the base units.

•Density-Based Clustering Technique: In Density-Based Methods, A cluster is a


maximal set of connected dense units in a subspace.
ProCLUS

Projected Clustering

Projected clustering, also known as subspace clustering, is a technique that is used to identify
clusters in high-dimensional data by considering subsets of dimensions or projections of the data
into lower dimensions.

The projected clustering algorithm is based on the concept of k-medoid clustering, which was
presented by Aggarwal (1999).
It starts by selecting medoids from a sample of the data and then iteratively
upgrades the results.

The quality of clusters in the projected clustering algorithm is typically measured


on the average distance between data points and their closest medoid.

This measure helps in determining how compact and separated the clusters are in
the output.
Clustering in Non-Euclidean Space
GRGPF Algorithm( V.Ganti, R.Ramakrishnan, J.Gehrke, A.Powele and J.French)
Algorithm

I) Representing Cluster in GRGPF Algortihm

if P= Point in cluster
ROWSUM(P)= Sum of square of distance from P to each of other point in cluster
I) Initializing the cluster tree
II) Adding points in GRGPF Algorithm

ROWSUM(P)= ROWSUM(C)+ N. d(P,C)


P=New point and d(P,C)= distance between P & Clustroid C
I) Splitting & Merging Cluster

Radius is =under root of (ROWSUM(C)/N)


CLUSTERING FOR STREAM & PARALLELISM
BDMO Algorithm( B.Babcock, M.Datar, R.Motwani, L O Callaghan)

You might also like