unit-4.pptx
unit-4.pptx
• ARM searches for interesting relationship among items in given data set.
• MBA is mathematical modelling technique based upon theory that if you buy certain group of
items, you are likely to buy another group of items.
• The set of items a customer buy is referred as itemset & MBA seeks to find relationship
between purchases.
• It is used to analyse customer purchasing behavior and helps in increasing the sales &
maintain inventory.
• In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together, seeking
to find associations that occur far more often than expected.
• Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent itemsets are
used to find k+1 itemsets.
• Market Basket Analysis is modelled on Association rule mining, i.e., the IF {}, THEN {}
construct. For example, IF a customer buys bread, THEN he is likely to buy butter as well.
Association rules are usually represented as: {Bread} -> {Butter}
• The Apriori algorithm : It identifies frequent items in the database and then evaluates their
frequency as the datasets are expanded to larger sizes.
Apriori Property
CONFIDENCE
Whether the product sales are popular on individual sales or through combined
sales has been calculated. That is calculated with combined transactions/individual
transactions.
Step-2: K=2
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.
Step-3:
Step-4: We stop here because no frequent itemsets are found further
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
I={1,2,3} and {1,2,5}
S={1,2,3,(1,2),(1,3),(2,3)} and {1,2,5,(1,2),(1,5),(2,5)}
Rule : S🡪 (I-S)
SO rules can be
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Tutorial
Step-1: Calculating C1 and L1:
•In the first step, we will create a table that contains support count The frequency of each itemset individually
in the dataset) of each itemset in the given dataset. This table is called the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the Minimum Support 2. It will
give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support, except the E, so E
itemset will be removed.
• It cannot be relied upon either to produce all itemset that are frequent in whole
dataset, nor will it produce only itemset that are frequent in whole.
• An itemset i.e frequent in whole but not in sample is false Negative while
itemset i.e frequent in sample but not the whole is false positive.
• We can eliminate False positive by making pass through full dataset & counting
all itemset that were identified as frequent in sample & also frequent in whole.
• We cannot eliminate false negative completely but we can reduce their number
if amount of main memory allows it.
The Algorithm of Savasere, Omiecinski, and Navathe(SON)
• Improvement algorithm.
• Avoid both False Negatives and False Positives, at the cost of making two full
passes.
• Idea is to divide input file into chunks.
• Treat each chunk as sample & run algorithm on that chunk.
• Once all chunk processed take union of all itemset that have been found frequent
for one or more chunk. These are candidate itemsets.
• Every itemset i.e frequent in whole is frequent in atleast one chunk & we can be
sure that all truly frequent itemset are among the candidate i.e there are No False
negative
The SON Algorithm and MapReduces
Toivonens Algorithm
• One pass over a small sample and One full pass over the data.
• Avoid both FN and FP, but there is a small probability that it will fail to produce any
answer at all.
Avoid both Fasle Negatives and False Positives, at the cost of making two full passes.
1.First Map Function: $(F,1)$, where $F$ is a frequent itemset from the sample.
2.First Reduce Function: combine all the $F$ to construct the candidate itemsets.
3.Second Map Function: $(C,v)$, where $C$ is one of the candidate sets and $v$ is the support.
4.Second Reduce Function: Sum and filter out the frequent itemsets.
Toivonens Algorithm
pass over a small sample and one full pass over the data.
avoid both FN and FP, but there is a small probability that it will fail to produce any
answer at all.
1.1st pass: candidates
1. select a small sample.
2. use a smaller threshold, such as $0.9ps$, to find candidate frequent itemsets $F$.
3. construct the negative border($N$):
They are not frequent in the sample, but all of their immediate subsets(subsets constructed by
deleting exactly one item) are frequent in the sample.
2.2nd pass: check, counting all $F$ and $N$.
1. if no member of $N$ is frequent in the whole datasets. $to$ output the $F$.
2. otherwise, give no answer and resample again.
Why it works.
1.eliminate FP $gets$ check in the full datasets.
2.eliminate FN(namely, find all real frequent itemset in the sample):
Clustering
Clustering is the task of dividing the unlabeled data or data points into different clusters such that
similar data points fall in the same cluster than those which differ from the others.
Partitioning Clustering Density Based Clustering
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters.
K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
K-means Algorithm
2. Manhattan Distance:
3. Minkowski distance:
Euclidean Distance:
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
CLIQUE(Clustering in QUEst)
Projected Clustering
Projected clustering, also known as subspace clustering, is a technique that is used to identify
clusters in high-dimensional data by considering subsets of dimensions or projections of the data
into lower dimensions.
The projected clustering algorithm is based on the concept of k-medoid clustering, which was
presented by Aggarwal (1999).
It starts by selecting medoids from a sample of the data and then iteratively
upgrades the results.
This measure helps in determining how compact and separated the clusters are in
the output.
Clustering in Non-Euclidean Space
GRGPF Algorithm( V.Ganti, R.Ramakrishnan, J.Gehrke, A.Powele and J.French)
Algorithm
if P= Point in cluster
ROWSUM(P)= Sum of square of distance from P to each of other point in cluster
I) Initializing the cluster tree
II) Adding points in GRGPF Algorithm