0% found this document useful (0 votes)

57 views80 pages

Big Data Analytics AAM Unit 4

Uploaded by

fattestbully

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views80 pages

Big Data Analytics AAM Unit 4

Uploaded by

fattestbully

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 80

Big Data Analytics

(20CS733)
Unit IV
Unit IV
Frequent Itemsets And Clustering :
Mining Frequent Itemsets - Market Based
Model – Apriori Algorithm – Handling Large
Data Sets In Main Memory – Limited Pass
Algorithm – Counting Frequent Itemsets In A
Stream – Clustering Techniques –
Hierarchical – K- Means – Clustering High
Dimensional Data – CLIQUE And PROCLUS –
Frequent Pattern Based Clustering Methods
– Clustering In Non-Euclidean Space –
Clustering For Streams And Parallelism.
Association Rule Discovery
Supermarket shelf management – Market-
basket model:
 Goal: Identify items that are bought together

by sufficiently many customers

 Approach: Process the sales data collected

with barcode scanners to find dependencies

among items
 A classic rule:

◦ If someone buys diaper and milk, then he/she is

likely to buy beer
◦ Don’t be surprised if you find six-packs next to
diapers!
The Market-Basket Model
Input:
 A large set of items TID Items
◦ e.g., things sold in a 1 Bread, Coke, Milk
supermarket 2 Beer, Bread
3 Beer, Coke, Diaper, Milk
 A large set of baskets 4 Beer, Bread, Diaper, Milk
 Each basket is a 5 Coke, Diaper, Milk

small subset of items

Output:
◦ e.g., the things one
customer buys on one day Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
 Want to discover {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
association rules
◦ People who bought {x,y,z} tend to buy {v,w}
 Amazon!
Applications – (1)
 Items = products; Baskets = sets of
products someone bought in one trip to the
store
 Real market baskets: Chain stores keep

TBs of data about what customers buy

together
◦ Tells how typical customers navigate stores, lets
them position tempting items
◦ Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
◦ Need the rule to occur frequently, or no money
 Amazon’s people who bought X also
bought Y
Applications – (2)
 Baskets = sentences; Items = documents
containing those sentences
◦ Items that appear together too often could
represent plagiarism
◦ Notice items do not have to be “in” baskets

 Baskets = patients; Items = drugs & side-

effects
◦ Has been used to detect combinations
of drugs that result in particular side-effects
◦ But requires extension: Absence of an item
needs to be observed as well as presence
More generally
 A general many-to-many mapping
(association) between two kinds of
things
◦ But we ask about connections among “items”,
not “baskets”

 For example:
◦ Finding communities in graphs (e.g., Twitter)
Example:
 Finding communities in graphs (e.g.,
Twitter)
 Baskets = nodes; Items = outgoing

neighbors  How?
◦ Searching for complete bipartite subgraphs Ks,t of
◦ View each node i as a
a big graph
basket Bi of nodes i it points
to
t nodes
s nodes

◦ Ks,t = a set Y of size t that

…
…

occurs in s buckets Bi
…

◦ Looking for Ks,t  set of

A dense 2-layer support s and look at layer t –
graph all frequent sets of size t
First: Define
Outline
Frequent itemsets
Association rules:
Confidence, Support, Interestingness

Then: Algorithms for finding

frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithm + 2 refinements
Frequent Itemsets
 Simplest question: Find sets of items that
appear together “frequently” in baskets
 Support for itemset I: Number of baskets

containing all items in I

◦ (Often expressed as a fraction TID Items

of the total number of baskets) 1

2
Bread, Coke, Milk
Beer, Bread
 Given a support threshold s, 3
4
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
then sets of items that appear 5 Coke, Diaper, Milk

in at least s baskets are called Support of

{Beer, Bread} = 2
frequent itemsets
Example: Frequent
Itemsets
 Items = {milk, coke, pepsi, beer, juice}
 Support threshold = 3 baskets

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

 Frequent itemsets: {m}, {c}, {b}, {j},

{m,b}, {b,c}
,
{c,j}.
Association Rules
 Association Rules:
If-then rules about the contents of baskets
 {i , i ,…,i } → j means: “if a basket contains
1 2 k
all of i1,…,ik then it is likely to contain j”
 In practice there are many rules, want
to find significant/interesting ones!
 Confidence of this association rule is the

probability of j given I = {i1,…,ik}

support( I  j )
conf( I  j ) 
support( I )
Interesting Association
Rules
 Not all high-confidence rules are
interesting
◦ The rule X → milk may have high confidence for
many itemsets X, because milk is just purchased
very often (independent of X) and the confidence
will be high
 Interest of an association rule I → j:
difference between its confidence and the
fraction of baskets that contain j

I arej )those
Interest(rules
◦ Interesting conf(  positive
withI high j )  Pr[orj ]
negative interest values (usually above 0.5)
Example: Confidence and Interest

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

 Association rule: {m, b} →c

◦ Confidence = 2/4 = 0.5
◦ Interest = |0.5 – 5/8| = 1/8
 Item c appears in 5/8 of the baskets
 Rule is not very interesting!
Finding Association Rules
 Problem: Find all association rules with
support ≥s and confidence ≥c
◦ Note: Support of an association rule is the
support of the set of items on the left side
 Hard part: Finding the frequent
itemsets!
◦ If {i1, i2,…, ik} → j has high support and confidence,
then both {i1, i2,…, ik} and
{i1, i2,…,ik, j} will be “frequent”
support( I  j )
conf( I  j ) 
support( I )
Mining Association Rules
 Step 1: Find all frequent itemsets I

 Step 2: Rule generation

◦ For every subset A of I, generate a rule A → I \ A
 Since I is frequent, A is also frequent
 Variant 1: Single pass to compute the rule confidence
 confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
 Variant 2:
 Observation: If A,B,C→D is below confidence, so is
A,B→C,D
 Can generate “bigger” rules from smaller ones!
◦ Output the rules above the confidence
threshold
Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Support threshold s = 3, confidence c = 0.75
 1) Frequent itemsets:

◦ {b,m} {b,c} {c,m} {c,j} {m,c,b}

 2) Generate rules:
◦ b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
◦ m→b: c=4/5 … b,m→c:
c=3/4
◦ b→c,m: c=3/6
Compacting the Output
 To reduce the number of rules we can
post-process them and only output:
◦ Maximal frequent itemsets:
No immediate superset is frequent
 Gives more pruning
or
◦ Closed itemsets:
No immediate superset has the same count (> 0)
 Stores not only frequent information, but exact
counts
Example: Maximal/Closed
Frequent, but
superset BC
Support Maximal(s=3) Closed
also frequent.
A 4 No No
Frequent, and
B 5 No its Yes
only superset,
C 3 No No ABC, not freq.
AB 4 Yes YesSuperset BC
has same count.
AC 2 No No
Its only super-
BC 3 Yes Yes set, ABC, has
smaller count.
ABC 2 No Yes
Finding Frequent Itemsets
Itemsets: Computation
Model
 Back to finding frequent itemsets ItemItem
 Typically, data is kept in flat files Item
Item
rather than in a database system: Item
Item
◦ Stored on disk Item
Item
◦ Stored basket-by-basket Item

◦ Baskets are small but we have Item

Item
many baskets and many items Item

 Expand baskets into pairs, triples, etc.

as you read baskets Etc.
 Use k nested loops to generate all
sets of size k

Items are positive integers,

Note: We want to find frequent itemsets. To find them, we and boundaries between
have to count them. To count them, we have to generate them. baskets are –1.
Computation Model
 The true cost of mining disk-resident data is
usually the number of disk I/Os
 In practice, association-rule algorithms read
the data in passes – all baskets read in
turn
 We measure the cost by the number of
passes an algorithm makes over the data
Main-Memory Bottleneck
 For many frequent-itemset algorithms,
main-memory is the critical resource
◦ As we read baskets, we need to count
something, e.g., occurrences of pairs of items
◦ The number of different things we can count
is limited by main memory
◦ Swapping counts in/out is a disaster (why?)
Finding Frequent Pairs
 The hardest problem often turns out to
be finding the frequent pairs of items {i1,
i2}
◦ Why? Freq. pairs are common, freq. triples are rare
 Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
 Let’s first concentrate on pairs, then
extend to larger sets
 The approach:

◦ We always need to generate all the itemsets

◦ But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent
Naïve Algorithm
 Naïve approach to finding frequent
pairs
 Read file once, counting in main memory

the occurrences of each pair:

◦ From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
 Fails if (#items)2 exceeds main memory
◦ Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
 Suppose 105 items, counts are 4-byte integers
 Number of pairs of items: 105(105-1)/2 = 5*109
 Therefore, 2*1010 (20 gigabytes) of memory needed
Counting Pairs in Memory
Two approaches:
 Approach 1: Count all pairs using a matrix
 Approach 2: Keep a table of triples [i, j, c] =

“the count of the pair of items {i, j} is c.”

◦ If integers and item ids are 4 bytes, we need
approximately 12 bytes for pairs with count > 0
◦ Plus some additional overhead for the hashtable
Note:
 Approach 1 only requires 4 bytes per pair
 Approach 2 uses 12 bytes per pair

(but only for pairs with count > 0)

Comparing the 2 Approaches

12 per
4 bytes per pair
occurring pair

Triangular Matrix Triples

Comparing the two
approaches
 Approach 1: Triangular Matrix
◦ n = total number items
◦ Count pair of items {i, j} only if i<j
◦ Keep pair counts in lexicographic order:
 {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},
…
◦ Pair {i, j} is at position (i –1)(n– i/2) + j –1
◦ Total number of pairs n(n –1)/2; total bytes= 2n2
◦ Triangular Matrix requires 4 bytes per pair
 Approach 2 uses 12 bytes per occurring
pair
(but only for pairs with count > 0)
◦ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
Comparing the two
approaches
Approach 1: Triangular Matrix
◦ n = total number items
◦ Count pair of items {i, j} only if i<j
Problem
◦ Keep pair is if we
counts in lexicographic order:
have too many
 {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},
…
items
◦ Pair {i, so the
j} is at position pairs
(i –1)(n– i/2) + j –1
doofnot
◦ Total number pairs fit intototal bytes= 2n
n(n –1)/2; 2

◦ Triangular Matrix requires 4 bytes per pair

memory.
Approach 2 uses 12 bytes per pair
(but Can
only forwe
pairsdo
withbetter?
count > 0)
◦ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
A-Priori Algorithm
A-Priori Algorithm – (1)
 A two-pass approach called
A-Priori limits the need for
main memory
 Key idea: monotonicity

◦ If a set of items I appears at

least s times, so does every subset J of I
 Contrapositive for pairs:
If item i does not appear in s baskets, then
no pair including i can appear in s baskets
 So, how does A-Priori find freq. pairs?
A-Priori Algorithm – (2)
 Pass 1: Read baskets and count in main memory
the occurrences of each individual item
 Requires only memory proportional to #items

 Items that appear times are the frequent

items
 Pass 2: Read baskets again and count in main
memory only those pairs where both elements
are frequent (from Pass 1)
◦ Requires memory proportional to square of frequent
items only (for counts)
◦ Plus a list of the frequent items (so you know what must
be counted)
Main-Memory: Picture of A-
Priori

Frequent items
Item counts
Main memory

Counts of
pairs of frequent
items
(candidate
pairs)

Pass 1 Pass 2
Detail for A-Priori
 You can use the
triangular matrix
method with n = Item counts Frequent Old
item
number of frequent items
#s
items
◦ May save space compared Counts of pairs

Main memory
with storing triples Counts
of of
frequent
pairs of
items
 Trick: re-number frequent items
frequent items 1,2,…
and keep a table
relating new numbers Pass 1 Pass 2
to original item
numbers
Frequent Triples, Etc.
 For each k, we construct two sets of
k-tuples (sets of size k):
◦ Ck = candidate k-tuples = those that might be
frequent sets (support > s) based on information
from the pass for k–1
◦ Lk = the set of truly frequent k-tuples

Count All pairs Count

All of items To be
the items the pairs explained
items from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3

Example ** Note here we generate new candidates by generating C k from Lk-1 and L1.
But that one can be more careful with candidate generation. For example, in
C3 we know {b,m,j} cannot be frequent since {m,j} is not frequent

 Hypothetical steps of the A-Priori algorithm

◦ C1 = { {b} {c} {j} {m} {n} {p} }
◦ Count the support of itemsets in C 1
◦ Prune non-frequent: L1 = { b, c, j, m }
◦ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
◦ Count the support of itemsets in C 2
◦ Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
◦ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
◦ Count the support of itemsets in C 3
◦ Prune non-frequent: L3 = { {b,c,m} }
A-Priori for All Frequent Itemsets
 One pass for each k (itemset size)
 Needs room in main memory to count

each candidate k–tuple

 For typical market-basket data and

reasonable support (e.g., 1%), k = 2 requires

the most memory
 Many possible extensions:
◦ Association rules with intervals:
 For example: Men over 65 have 2 cars
◦ Association rules when items are in a taxonomy
 Bread, Butter → FruitJam
 BakedGoods, MilkProduct → PreservedGoods
◦ Lower the support s as itemset gets bigger
Market Basket Model

What is Market-Basket Model?

With large data available, analysts are interested

in getting something useful.

•Introduction
•Applications
•Association rules
•Support
•Confidence
•Example
Introduction:
•Name derived from idea of customers throwing all
their purchases into shopping cart or market
basket during grocery shopping.
•Method of data analysis for marketing and
retailing
•Determines the products purchased together
•Strength of this method is using computer tools for
mining and analysis purposes.
Strength of Market Basket Analysis: Beer and
diaper story
•A large super market chain in US, analysed the
buying habits of their customers and found a
statistically significant correlation between purchases
of beer and purchases of diapers on weekends.

•The super market decided to place the beer next to

diapers, resulting in increased sales of both.
Applications
•Cross selling: buy Burger + vendors offers coke
•Product placement:
•Catalog design/Store layout; Hot areas
•Loss leader analysis: pricing strategy by keeping low
moving product at less cost than MRP along with fast
moving product at high cost price.

Association rule:
•How to find the products are purchased together or
entities that go together.

Ans : Association rule

Rule form
Antecendent => Consequent [ support, confidence]
A => b [s,c]

•Support (s): denotes the percentage of transactions

that contain
(A U B)

Support ( A=> B[s,c]) = p(A U B)

•Confidence:
Denotes the percentage of transactions containing A
which also contain B.

Confidence ( A=>B [s, c]) = p ( B/ A) = p( A U B)/ p

•An association rules are considered interesting if they
satisfy both a minimum support threshold and
minimum confidence threshold.

•Example:
Transaction ID Products
1 Shoes, trouser, shirt, belt
2 Shoes, , trouser, shirt, hat, belt, scarf
3 Shoes, shirt
4 Shoes, trouser, belt
Consider rule Trouser => Shirt, we’ll check whether
this rule would be interesting one or not.
Transaction shoes trouser shirt Belt Hat scarf

T1 1 1 1 1 0 0
T2 1 1 1 1 1 1
T3 1 0 1 0 0 0
T4 1 1 0 1 0 0
Minimum support: 40 %
Minimum confidence: 65%

Support (Trouser => Shirt)

p ( A U B) = 2 / 4 = 0.5 [ trouser and shirt

occurring together against total transaction]

Confidence (Trouser => Shirt)

p ( A U B) / p (A) = 2/ 3 = 0.66

Since the support and confidence greater than the

minimum threshold, rule is an interesting one.

Check for Trouser => Belt, Shoe => shirt etc.

Market Basket Model: Show all the rules that are interesting for the
vendor for the following transactions.

ID Item
1 HotDogs, Buns, Ketchup
2 HotDogs, Buns
3 HotDogs, Coke, Chips
4 Chips, Coke
5 Chips, Ketchup
6 HotDogs, Coke, Chips
Clustering
High Dimensional Data
 Given a cloud of data points we want
to understand its structure
The Problem of Clustering
 Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
◦ Members of a cluster are close/similar to each other
◦ Members of different clusters are dissimilar
 Usually:
◦ Points are in a high-dimensional space
◦ Similarity is defined using a distance measure
 Euclidean, Cosine, Jaccard, edit distance, …
Example: Clusters &
Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Outlier Cluster
Clustering is a hard
problem!
Why is it hard?
 Clustering in two dimensions looks easy
 Clustering small amounts of data looks easy
 And in most cases, looks are not deceiving

 Many applications involve not 2, but 10 or

10,000 dimensions
 High-dimensional spaces look

different: Almost all pairs of points are at

about the same distance
Clustering Problem: Galaxies
 A catalog of 2 billion “sky objects”
represents objects by their radiation in
7 dimensions (frequency bands)
 Problem: Cluster into similar objects,

e.g., galaxies, nearby stars, quasars,

etc.
 Sloan Digital Sky Survey
Clustering Problem: Music CDs
 Intuitively: Music divides into
categories, and customers prefer a few
categories
◦ But what are categories really?

 Represent a CD by a set of customers who

bought it:

 Similar CDs have similar sets of customers,

and vice-versa
Clustering Problem: Music
CDs
Space of all CDs:
 Think of a space with one dim. for each

customer
◦ Values in a dimension may be 0 or 1 only
◦ A CD is a point in this space (x1, x2,…, xk),
where xi = 1 iff the i th customer bought the CD

 For Amazon, the dimension is tens of millions

 Task: Find clusters of similar CDs
Clustering Problem:
Documents
Finding topics:
 Represent a document by a vector

(x1, x2,…, xk), where xi = 1 iff the i th word

(in some order) appears in the document
◦ It actually doesn’t matter if k is infinite; i.e., we
don’t limit the set of words

 Documents with similar sets of words

may be about the same topic
Cosine, Jaccard, and
Euclidean
As with CDs we have a choice when
we think of documents as sets of
words or shingles:
◦ Sets as vectors: Measure similarity by the
cosine distance
◦ Sets as sets: Measure similarity by the
Jaccard distance
◦ Sets as points: Measure similarity by
Euclidean distance
Overview: Methods of
Clustering
 Hierarchical:
◦ Agglomerative (bottom up):
 Initially, each point is a cluster
 Repeatedly combine the two
“nearest” clusters into one
◦ Divisive (top down):
 Start with one cluster and recursively split it

 Point assignment:
◦ Maintain a set of clusters
◦ Points belong to “nearest” cluster
Hierarchical Clustering
 Key operation:
Repeatedly combine
two nearest clusters

 Three important questions:

◦ 1) How do you represent a cluster of more
than one point?
◦ 2) How do you determine the “nearness” of
clusters?
◦ 3) When to stop combining clusters?
Hierarchical Clustering
 Key operation: Repeatedly combine two
nearest clusters
 (1) How to represent a cluster of many

points?
◦ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
 Euclidean case: each cluster has a
centroid = average of its (data)points
 (2) How to determine “nearness” of

clusters?
◦ Measure cluster distances by distances of centroids
Example: Hierarchical
clustering
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)

Data:
o … data point
x … centroid
Dendrogram
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
 The only “locations” we can talk about are

the points themselves

◦ i.e., there is no “average” of two points

 Approach 1:
◦ (1) How to represent a cluster of many
points?
clustroid = (data)point “closest” to other points
◦ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid,
when computing inter-cluster distances
“Closest” Point?
 (1) How to represent a cluster of many
points?
clustroid = point “closest” to other points
 Possible meanings of “closest”:

◦ Smallest maximum distance to other points

◦ Smallest average distance to other points
◦ Smallest sum of squares of distances to other
points
 min  d (
For distance metric d clustroid c of cluster C is:
c
x ,
xC
c ) 2

Datapoint Centroid

X Centroid is the avg. of all (data)points

in the cluster. This means centroid is
Clustroid an “artificial” point.
Cluster on Clustroid is an existing (data)point
3 datapoints that is “closest” to all other points in
the cluster.
Defining “Nearness” of
Clusters
 (2) How do you determine the
“nearness” of clusters?
◦ Approach 2:
Intercluster distance = minimum of the
distances between any two points, one from each
cluster
◦ Approach 3:
Pick a notion of “cohesion” of clusters, e.g.,
maximum distance from the clustroid
 Merge clusters whose union is most cohesive
Cohesion
 Approach 3.1: Use the diameter of the
merged cluster = maximum distance
between points in the cluster
 Approach 3.2: Use the average distance

between points in the cluster

 Approach 3.3: Use a density-based

approach
◦ Take the diameter or avg. distance, e.g., and
divide by the number of points in the cluster
Implementation
 Naïve implementation of hierarchical
clustering:
◦ At each step, compute pairwise distances
between all pairs of clusters, then merge
◦ O(N3)

 Careful implementation using priority queue

can reduce time to O(N2 log N)
◦ Still too expensive for really big datasets
that do not fit in memory
k-means clustering
k–means Algorithm(s)
 Assumes Euclidean space/distance
 Start by picking k, the number of clusters
 Initialize clusters by picking one point per
cluster
◦ Example: Pick one point at random, then k-1
other points, each as far away as possible from
the previous points
Populating Clusters
 1) For each point, place it in the cluster
whose current centroid it is nearest
 2) After all points are assigned, update the
locations of centroids of the k clusters
 3) Reassign all points to their closest
centroid
◦ Sometimes moves points between clusters

 Repeat 2 and 3 until convergence

◦ Convergence: Points don’t move between clusters
and centroids stabilize
Example: Assigning
Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 1
Example: Assigning
Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 2
Example: Assigning
Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters at the end
Getting the k right
How to select k?
 Try different k, looking at the change in the

average distance to centroid as k increases

 Average falls rapidly until right k, then

changes little

Best value
of k
Average
distance to
centroid k
Example: Picking k
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Example: Picking k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Example: Picking k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Clustering

•Clustering is the grouping of a particular set of

objects based on their characteristics, aggregating
them according to their similarities.

•It is an unsupervised learning approach

K Means clustering

1. Objective: to make groups of data points

2 5 6 8 12 15 18 28 30

2. Define the value of K ( number of clusters) say k

3. Select cluster centre, different criteria can be

followed such as random number, selection of three
farthest numbers etc.

2 5 6 8 12 15 18 28 30

2 5 6 8 12 15 18 28 30
4. Find the mean of each cluster : 4.3 , 13.25 , 29 This
will serves as
new cluster heads/centre

5. Find the cluster with new cluster centre

We ‘ll get

2 4.3 5 6 8 12 13.25 15 18 28 29
30

Repeat this process till cluster centre converges ( there

is no change in cluster centre)

Ilovepdf Merged
No ratings yet
Ilovepdf Merged
178 pages
Unit 4 - DA - Frequent Itemsets and Associations
No ratings yet
Unit 4 - DA - Frequent Itemsets and Associations
31 pages
Equent Patterns
No ratings yet
Equent Patterns
74 pages
Association Rules and Frequent Item Sets
No ratings yet
Association Rules and Frequent Item Sets
98 pages
Lect 6
No ratings yet
Lect 6
74 pages
38 GM - ASAP-Association Rule Mining
No ratings yet
38 GM - ASAP-Association Rule Mining
64 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
59 pages
Big Data - Week04 - Association Rules
No ratings yet
Big Data - Week04 - Association Rules
46 pages
4 Frequent Item Set Mining & Association Rules
No ratings yet
4 Frequent Item Set Mining & Association Rules
68 pages
Class 4-Associative Analysis
No ratings yet
Class 4-Associative Analysis
42 pages
L13 Apriori
No ratings yet
L13 Apriori
32 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
Unit - IV A DA
No ratings yet
Unit - IV A DA
39 pages
L2: Frequent Itemsets Mining and Association Rules
No ratings yet
L2: Frequent Itemsets Mining and Association Rules
54 pages
Unit 4 - Part 1
No ratings yet
Unit 4 - Part 1
152 pages
Unit 3
No ratings yet
Unit 3
44 pages
Assoc Rules1
No ratings yet
Assoc Rules1
32 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
02 Assocrules
No ratings yet
02 Assocrules
56 pages
Lec1b Assoc Rules
No ratings yet
Lec1b Assoc Rules
32 pages
5 Frequent Pattern Mining
No ratings yet
5 Frequent Pattern Mining
44 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
DM Unit-II
No ratings yet
DM Unit-II
80 pages
ML Unit - Iii
No ratings yet
ML Unit - Iii
64 pages
Datamining Lect2 Frequent
No ratings yet
Datamining Lect2 Frequent
59 pages
Lecture 10-Assiciation Rule Mining-I-M
No ratings yet
Lecture 10-Assiciation Rule Mining-I-M
30 pages
Association Rules
No ratings yet
Association Rules
56 pages
Ch06 Frequent Itemsets
No ratings yet
Ch06 Frequent Itemsets
59 pages
Data Mining: Frequent Itemsets and Association Rules
No ratings yet
Data Mining: Frequent Itemsets and Association Rules
105 pages
Association Rules
No ratings yet
Association Rules
58 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
11com OCM Final 21-22
80% (5)
11com OCM Final 21-22
5 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Data Mining of Very Large Data
No ratings yet
Data Mining of Very Large Data
50 pages
ch03 Assocrules
No ratings yet
ch03 Assocrules
59 pages
Asssociation Rules: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Asssociation Rules: Prof. Sin-Min Lee Department of Computer Science
68 pages
Association Rule Mod 3
No ratings yet
Association Rule Mod 3
28 pages
Module1 Part2
No ratings yet
Module1 Part2
17 pages
Dataanalytics Unit-4
No ratings yet
Dataanalytics Unit-4
23 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Association Rule Mining
No ratings yet
Association Rule Mining
19 pages
Unit 2
No ratings yet
Unit 2
14 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
Association Rule Mapping - Unit-4
No ratings yet
Association Rule Mapping - Unit-4
11 pages
"Association Rules": Market Baskets Frequent Itemsets A-Priori Algorithm
No ratings yet
"Association Rules": Market Baskets Frequent Itemsets A-Priori Algorithm
30 pages
Incremental Rules: Goals For Market-Basket Mining
No ratings yet
Incremental Rules: Goals For Market-Basket Mining
5 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
PM Clinic Dozers Komatsu
100% (1)
PM Clinic Dozers Komatsu
3 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Association Rules
No ratings yet
Association Rules
24 pages
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
No ratings yet
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
86 pages
CRIMINAL LAW - Estrada Problems and Answers
67% (3)
CRIMINAL LAW - Estrada Problems and Answers
3 pages
A Survey of Association Rule Mining For Customer Relationship Management
No ratings yet
A Survey of Association Rule Mining For Customer Relationship Management
7 pages
Cat - D8T Dozer Specs, Videos & 360 Views - D8 Dozer - Caterpillar
No ratings yet
Cat - D8T Dozer Specs, Videos & 360 Views - D8 Dozer - Caterpillar
17 pages
HPE - Dp00002639en - Us - HPE Smart Storage Administrator GUI User Guide
No ratings yet
HPE - Dp00002639en - Us - HPE Smart Storage Administrator GUI User Guide
142 pages
Mining Frequent, Patterns, Associations, and Correlations
No ratings yet
Mining Frequent, Patterns, Associations, and Correlations
13 pages
MR Haroon Khan Employment Offer Letter (Princess Cruise Ship Company)
No ratings yet
MR Haroon Khan Employment Offer Letter (Princess Cruise Ship Company)
3 pages
012 Cleanliness
No ratings yet
012 Cleanliness
34 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
Analog Display Digital VFO
No ratings yet
Analog Display Digital VFO
3 pages
Big Data Analytics - AAM - Unit 1
No ratings yet
Big Data Analytics - AAM - Unit 1
178 pages
Project 2
No ratings yet
Project 2
7 pages
Experiment 1
No ratings yet
Experiment 1
3 pages
3 Com
No ratings yet
3 Com
465 pages
F 305 Final Bill Checklist
No ratings yet
F 305 Final Bill Checklist
2 pages
Big Data Analytics - AAM - Unit 2
No ratings yet
Big Data Analytics - AAM - Unit 2
73 pages
Job Application Letter Title
100% (1)
Job Application Letter Title
8 pages
Using Genetic Algorithms in Process Planning For Job Shop Machining
No ratings yet
Using Genetic Algorithms in Process Planning For Job Shop Machining
12 pages
Organisational Behavior
No ratings yet
Organisational Behavior
50 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
OIST Research Intern Application
No ratings yet
OIST Research Intern Application
12 pages
Civpro Digests b1
No ratings yet
Civpro Digests b1
15 pages
647e1269017b6a1e238c38b6 EIR2023-Ethiopia
No ratings yet
647e1269017b6a1e238c38b6 EIR2023-Ethiopia
27 pages
300UT-PL Concrete Mixer
No ratings yet
300UT-PL Concrete Mixer
8 pages
Projection of Plane
0% (3)
Projection of Plane
2 pages
Data Mining UNIT - 2 (Data Warehouse Architecture)
No ratings yet
Data Mining UNIT - 2 (Data Warehouse Architecture)
3 pages
Concurrence of Big Data Analytics and Healthcare
No ratings yet
Concurrence of Big Data Analytics and Healthcare
10 pages
FM Chapter 16 Exercises
No ratings yet
FM Chapter 16 Exercises
7 pages
The Meaning of Foreign Exchange
No ratings yet
The Meaning of Foreign Exchange
5 pages
Faculty - COURSE - ALLOCATION - First - Semester - 2023-2024 - and 2024 - 2025 Academic - Session - Doc UPDATED
No ratings yet
Faculty - COURSE - ALLOCATION - First - Semester - 2023-2024 - and 2024 - 2025 Academic - Session - Doc UPDATED
3 pages
Broschure Eudragit Web
No ratings yet
Broschure Eudragit Web
4 pages
Corporate Social Responsibility - What Does It Mean ?: by Mallen Baker: First Published 8 Jun 2004
No ratings yet
Corporate Social Responsibility - What Does It Mean ?: by Mallen Baker: First Published 8 Jun 2004
4 pages
2012 Orientation Guide-1
No ratings yet
2012 Orientation Guide-1
22 pages
Advanced Microcontroller Programming DC
No ratings yet
Advanced Microcontroller Programming DC
3 pages
Conversion
No ratings yet
Conversion
1 page
TMSI
No ratings yet
TMSI
1 page
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
How To Sell Like Crazy On Ebay
From Everand
How To Sell Like Crazy On Ebay
Silas Meadowlark
No ratings yet
Camping, Here I Come!: Keeping a Budget
From Everand
Camping, Here I Come!: Keeping a Budget
Lisa Bullard
No ratings yet

Big Data Analytics AAM Unit 4

Uploaded by

Big Data Analytics AAM Unit 4

Uploaded by

Big Data Analytics

by sufficiently many customers

with barcode scanners to find dependencies

◦ If someone buys diaper and milk, then he/she is

small subset of items

TBs of data about what customers buy

 Baskets = patients; Items = drugs & side-

◦ Ks,t = a set Y of size t that

◦ Looking for Ks,t  set of

Then: Algorithms for finding

containing all items in I

of the total number of baskets) 1

in at least s baskets are called Support of

 Frequent itemsets: {m}, {c}, {b}, {j},

probability of j given I = {i1,…,ik}

 Association rule: {m, b} →c

 Step 2: Rule generation

◦ {b,m} {b,c} {c,m} {c,j} {m,c,b}

◦ Baskets are small but we have Item

 Expand baskets into pairs, triples, etc.

Items are positive integers,

◦ We always need to generate all the itemsets

the occurrences of each pair:

“the count of the pair of items {i, j} is c.”

(but only for pairs with count > 0)

Triangular Matrix Triples

◦ Triangular Matrix requires 4 bytes per pair

◦ If a set of items I appears at

 Items that appear times are the frequent

Count All pairs Count

C1 Filter L1 Construct C2 Filter L2 Construct C3

 Hypothetical steps of the A-Priori algorithm

each candidate k–tuple

reasonable support (e.g., 1%), k = 2 requires

What is Market-Basket Model?

With large data available, analysts are interested

•The super market decided to place the beer next to

Ans : Association rule

•Support (s): denotes the percentage of transactions

Support ( A=> B[s,c]) = p(A U B)

Confidence ( A=>B [s, c]) = p ( B/ A) = p( A U B)/ p

Support (Trouser => Shirt)

p ( A U B) = 2 / 4 = 0.5 [ trouser and shirt

Confidence (Trouser => Shirt)

Since the support and confidence greater than the

Check for Trouser => Belt, Shoe => shirt etc.

 Many applications involve not 2, but 10 or

different: Almost all pairs of points are at

e.g., galaxies, nearby stars, quasars,

 Represent a CD by a set of customers who

 Similar CDs have similar sets of customers,

 For Amazon, the dimension is tens of millions

(x1, x2,…, xk), where xi = 1 iff the i th word

 Documents with similar sets of words

 Three important questions:

the points themselves

◦ Smallest maximum distance to other points

X Centroid is the avg. of all (data)points

between points in the cluster

 Careful implementation using priority queue

 Repeat 2 and 3 until convergence

average distance to centroid as k increases

•Clustering is the grouping of a particular set of

•It is an unsupervised learning approach

1. Objective: to make groups of data points

2. Define the value of K ( number of clusters) say k

3. Select cluster centre, different criteria can be

5. Find the cluster with new cluster centre

Repeat this process till cluster centre converges ( there

You might also like