Data Analytics Unit-4
Data Analytics Unit-4
Clustering
1 Mining Frequent Itemsets
Frequent itemset mining is a popular data mining task that involves identifying sets of items that frequently
co-occur in a given dataset. In other words, it involves finding the items that occur together frequently and
then grouping them into sets of items. One way to approach this problem is by using the Apriori algorithm,
which is one of the most widely used algorithms for frequent itemset mining.
The Apriori algorithm works by iteratively generating candidate itemsets and then checking their fre-
quency against a minimum support threshold. The algorithm starts by generating all possible itemsets of
size 1 and counting their frequencies in the dataset. The itemsets that meet the minimum support threshold
are then selected as frequent itemsets. The algorithm then proceeds to generate candidate itemsets of size
2 from the frequent itemsets of size 1 and counts their frequencies. This process is repeated until no more
However, when dealing with large datasets, this approach can become computationally expensive due
T
AF
to the potentially large number of candidate itemsets that need to be generated and counted. Point-wise
DR
frequent itemset mining is a more efficient alternative that can reduce the computational complexity of the
Point-wise frequent itemset mining works by iterating over the transactions in the dataset and identifying
the itemsets that occur in each transaction. For each transaction, the algorithm generates a bitmap vector
where each bit corresponds to an item in the dataset, and its value is set to 1 if the item occurs in the
transaction and 0 otherwise. The algorithm then performs a bitwise AND operation between the bitmap
vectors of each transaction to identify the itemsets that occur in all the transactions. The itemsets that meet
The advantage of point-wise frequent itemset mining is that it avoids generating candidate itemsets that
are not present in the dataset, thereby reducing the number of itemsets that need to be generated and
counted. Additionally, point-wise frequent itemset mining can be parallelized, making it suitable for mining
In summary, point-wise frequent itemset mining is an efficient alternative to the Apriori algorithm for
3
frequent itemset mining. It works by iterating over the transactions in the dataset and identifying the
itemsets that occur in each transaction, thereby avoiding the generation of candidate itemsets that are not
Market-based modeling is a technique used in economics and business to analyze and simulate the behavior of
markets, particularly in relation to the supply and demand of goods and services. This modeling technique
involves creating mathematical models that can simulate how different market participants (consumers,
One of the most common market-based models is the supply and demand model, which assumes that the
price of a good or service is determined by the balance between its supply and demand. In this model, the
price of a good or service will rise if the demand for it exceeds its supply, and will fall if the supply exceeds
the demand.
Another popular market-based model is the game theory model, which is used to analyze how different
participants in a market interact with each other. Game theory models assume that market participants are
T
rational and act in their own self-interest, and seek to identify the strategies that each participant is likely
AF
Market-based models can be used to analyze a wide range of economic phenomena, from the pricing
of individual goods and services to the behavior of entire industries and markets. They can also be used
to test the potential impact of various policies and interventions on the behavior of markets and market
participants.
Overall, market-based modeling is a powerful tool for understanding and predicting the behavior of
markets and the economy as a whole. By creating mathematical models that simulate the behavior of
market participants and the interactions between them, economists and business analysts can gain valuable
insights into the workings of markets, and develop strategies for managing and optimizing their performance.
3 Apriori Algorithm
The Apriori algorithm is a popular algorithm used in data mining and machine learning to discover frequent
itemsets in large transactional datasets. It was proposed by Agrawal and Srikant in 1994 and is widely used
4
in association rule mining, market basket analysis, and other data mining applications.
The Apriori algorithm uses a bottom-up approach to generate all frequent itemsets by first identifying
frequent individual items and then using those items to generate larger itemsets. The algorithm works by
First, the algorithm scans the entire dataset to identify all individual items and their frequency of
occurrence. This information is used to generate the initial set of frequent itemsets.
Next, the algorithm uses a level-wise search strategy to generate larger itemsets by combining fre-
quent itemsets from the previous level. The algorithm starts with two-itemsets and then progressively
At each level, the algorithm prunes the search space by eliminating itemsets that cannot be frequent
based on the minimum support threshold. This is done using the Apriori principle, which states that
The algorithm terminates when no more frequent itemsets can be generated or when the maximum
Once all frequent itemsets have been identified, the Apriori algorithm can be used to generate association
AF
rules that describe the relationships between different items in the dataset. An association rule is a statement
DR
of the form X − > Y, where X and Y are itemsets and X is a subset of Y. The rule indicates that there is a
The strength of an association rule is measured using two metrics: support and confidence. Support is
the percentage of transactions in the dataset that contain both X and Y, while confidence is the percentage
Overall, the Apriori algorithm is a powerful tool for discovering frequent itemsets and association rules
in large datasets. By identifying patterns and relationships between different items in the dataset, it can
be used to gain valuable insights into consumer behavior, market trends, and other important business and
economic phenomena.
Handling large datasets in main memory can be a challenging task, as the amount of memory available on
most computer systems is often limited. However, there are several techniques and strategies that can be
5
used to effectively manage and analyze large datasets in main memory:
Use data compression: Data compression techniques can be used to reduce the amount of memory
required to store a dataset. Techniques such as gzip or bzip2 can compress text data, while binary
Use data partitioning: Large datasets can be partitioned into smaller, more manageable subsets,
which can be processed and analyzed in main memory. This can be done using techniques such as
Use data sampling: Data sampling can be used to select a representative subset of data for analysis,
without requiring the entire dataset to be loaded into memory. Random sampling, stratified sampling,
and cluster sampling are some of the commonly used sampling techniques.
Use in-memory databases: In-memory databases can be used to store large datasets in main
memory for faster querying and analysis. Examples of in-memory databases include Apache Ignite,
Use parallel processing: Parallel processing techniques can be used to distribute the processing of
T
AF
large datasets across multiple processors or cores. This can be done using libraries like Apache Spark,
DR
Use data streaming: Data streaming techniques can be used to process large datasets in real-time
by processing data as it is generated, rather than storing it in memory. Apache Kafka, Apache Flink,
and Apache Storm are some of the popular data streaming platforms.
Overall, effective management of large datasets in main memory requires a combination of data compres-
sion, partitioning, sampling, in-memory databases, parallel processing, and data streaming techniques. By
leveraging these techniques, it is possible to effectively analyze and process large datasets in main memory,
A limited pass algorithm is a technique used in data processing and analysis to efficiently process large
6
In a limited pass algorithm, the dataset is processed in a fixed number of passes or iterations, where each
pass involves processing a subset of the data. The algorithm ensures that each pass is designed to capture
the relevant information needed for the analysis, while minimizing the memory required to store the data.
For example, a limited pass algorithm for processing a large text file could involve reading the file in chunks
or sections, processing each section in memory, and then discarding the processed data before moving onto
the next section. This approach enables the algorithm to handle large datasets that cannot be loaded entirely
into memory.
Limited pass algorithms are often used in situations where the data cannot be stored in main memory,
or when the processing of the data requires significant computational resources. Examples of applications
that use limited pass algorithms include text processing, machine learning, and data mining.
While limited pass algorithms can be useful for processing large datasets with limited memory resources,
they can also be less efficient than algorithms that can process the entire dataset in a single pass. Therefore,
it is important to carefully design the algorithm to ensure that it can capture the relevant information needed
for the analysis, while minimizing the number of passes required to process the data.
Counting frequent itemsets in a stream is a problem of finding the most frequent itemsets in a continuous
DR
stream of transactions. This problem is commonly known as the Frequent Itemset Mining problem. Here
1. Initialize a hash table to store the counts of each itemset. The size of the hash table should be limited
3. Generate all the possible itemsets from the transaction. This can be done using the Apriori algorithm,
5. Prune infrequent itemsets from the hash table. An itemset is infrequent if its count is less than a
predefined threshold.
7
7. Output the frequent itemsets that remain in the hash table after processing all the transactions.
The main challenge in counting frequent itemsets in a stream is to keep track of the changing frequencies
of the itemsets as new transactions arrive. This can be done efficiently using the hash table to store the
counts of the itemsets. However, the hash table can become too large if the number of distinct itemsets is
too large. To prevent this, the hash table can be limited in size by using a hash function that maps each
itemset to a fixed number of hash buckets. The size of the hash table can be adjusted dynamically based on
Another challenge in counting frequent itemsets in a stream is to choose the threshold for the minimum
count of an itemset to be considered frequent. The threshold should be set high enough to exclude infrequent
itemsets, but low enough to include all the important frequent itemsets. The threshold can be determined
using heuristics or by using machine learning techniques to learn the optimal threshold from the data.
7 Clustering Techniques
Clustering techniques are used to group similar data points together in a dataset based on their similarity
This is a popular clustering algorithm that partitions a dataset into K clusters based on the mean dis-
tance of the data points to their assigned cluster centers. It involves an iterative process of assigning data
points to clusters and updating the cluster centers until convergence. K-Means is commonly used in image
K-Means clustering is a popular unsupervised machine learning algorithm that partitions a dataset into k
Assign each data point to the nearest cluster centroid based on its distance.
Calculate the new cluster centroids based on the mean of all data points assigned to that cluster.
8
Repeat steps 2-3 until the cluster centroids no longer change significantly, or a maximum number of
iterations is reached.
The distance metric used for step 2 is typically the Euclidean distance, but other distance metrics can
be used as well.
The K-Means algorithm aims to minimize the sum of squared distances between each data point and
its assigned cluster centroid. This objective function is known as the within-cluster sum of squares
To determine the optimal number of clusters, a common approach is to use the elbow method. This
involves plotting the WCSS or SSE against the number of clusters and selecting the number of clusters
at the ”elbow” point, where the rate of decrease in WCSS or SSE begins to level off.
K-Means is a computationally efficient algorithm that can scale to large datasets. It is particularly useful
when the data is high-dimensional and traditional clustering algorithms may be too slow. However, K-Means
requires the number of clusters to be pre-defined and may converge to a suboptimal solution if the initial
cluster centroids are not well chosen. It is also sensitive to non-linear data and may not work well with such
T
9
Advantages: Disadvantages:
easy to understand and implement, making it Means requires the number of clusters to be
a popular choice for clustering tasks. pre-defined, which can be a challenge when the
datasets. It is particularly useful when the data Sensitive to initial cluster centers: K-Means is
is high-dimensional and traditional clustering sensitive to the initial placement of cluster cen-
algorithms may be too slow. ters and can converge to a suboptimal solution
K-Means works well with circular or spherical Can converge to a local minimum: K-Means
clusters, making it suitable for datasets that ex- can converge to a local minimum rather than
clustering solution.
Provides a clear and interpretable result: K-
T
Means provides a clear and interpretable clus- Not suitable for non-linear data: K-Means as-
AF
tering result, where each data point is assigned sumes that the data is linearly separable and
DR
to one of the k clusters. may not work well with non-linear data.
In summary, K-Means is a simple and fast clustering algorithm that works well with circular or spherical
clusters. However, it requires the number of clusters to be pre-defined and may converge to a suboptimal
solution if the initial cluster centers are not well chosen. It is also sensitive to non-linear data and may not
This technique builds a hierarchy of clusters by recursively dividing or merging clusters based on their
data point starts in its own cluster, and then pairs of clusters are successively merged until all data points
belong to a single cluster. Divisive clustering starts with all data points in a single cluster and recursively
divides them into smaller clusters. Hierarchical clustering is useful in gene expression analysis, social network
10
analysis, and image analysis.
This technique identifies clusters based on the density of data points. It assumes that clusters are areas of
higher density separated by areas of lower density. Density-based clustering algorithms, such as DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), group together data points that are closely
packed together and separate outliers. Density-based clustering is commonly used in image processing,
This technique models the distribution of data points using a mixture of Gaussian probability distributions.
Each component of the mixture represents a cluster, and the algorithm estimates the parameters of the
mixture using the Expectation-Maximization algorithm. Gaussian Mixture Models are commonly used in
This technique converts the data points into a graph and then partitions the graph into clusters based
DR
on the eigenvalues and eigenvectors of the graph Laplacian matrix. Spectral clustering is useful in image
Each clustering technique has its own strengths and weaknesses, and the choice of clustering algorithm
depends on the nature of the data, the clustering objective, and the computational resources available.
Clustering high-dimensional data is a challenging task because the distance or similarity measures used in
most clustering algorithms become less meaningful in high-dimensional space. Here are some techniques for
11
8.1 Dimensionality Reduction:
High-dimensional data can be transformed into a lower-dimensional space using dimensionality reduction
techniques, such as Principal Component Analysis (PCA) or t-SNE (t-distributed Stochastic Neighbor Em-
bedding). Dimensionality reduction can help to reduce the curse of dimensionality and make the clustering
Not all features in high-dimensional data are equally informative. Feature selection techniques can be used
to identify the most relevant features for clustering and discard the redundant or noisy features. This can
help to improve the clustering accuracy and reduce the computational cost.
Subspace clustering is a clustering technique that identifies clusters in subspaces of the high-dimensional
space. This technique assumes that the data points lie in a union of subspaces, each of which represents
a cluster. Subspace clustering algorithms, such as CLIQUE (CLustering In QUEst), identify the subspaces
T
Density-based clustering algorithms, such as DBSCAN, can be used for clustering high-dimensional data by
defining the density of data points in each dimension. The clustering algorithm identifies regions of high
Ensemble clustering combines multiple clustering algorithms or different parameter settings of the same
algorithm to improve the clustering performance. Ensemble clustering can help to reduce the sensitivity of
Deep learning-based clustering techniques, such as Deep Embedded Clustering (DEC) and Autoencoder-
based Clustering (AE-Clustering), use neural networks to learn a low-dimensional representation of high-
12
dimensional data and cluster the data in the reduced space. These techniques have shown promising results in
clustering high-dimensional data in various domains, including image analysis and gene expression analysis.
Clustering high-dimensional data requires careful consideration of the choice of clustering algorithm,
feature selection or dimensionality reduction technique, and parameter settings. A combination of different
CLIQUE (CLustering In QUEst) and ProCLUS are two popular subspace clustering algorithms for high-
dimensional data.
CLIQUE is a density-based algorithm that works by identifying dense subspaces in the data. It assumes
that clusters exist in subspaces of the data that are dense in at least k dimensions, where k is a user-defined
parameter. The algorithm identifies all possible dense subspaces by enumerating all combinations of k
dimensions and checking if the corresponding subspaces are dense. It then merges the overlapping subspaces
to form clusters. CLIQUE is efficient for high-dimensional data because it only considers a small number of
dimensions at a time.
ProCLUS (PROjective CLUSters) is a subspace clustering algorithm that works by identifying clusters
T
AF
in a low-dimensional projection of the data. It first selects a random projection matrix and projects the data
DR
onto a lower-dimensional space. It then uses K-Means clustering to cluster the projected data. The algorithm
iteratively refines the projection matrix and re-clusters the data until convergence. The final clusters are
projected back to the original high-dimensional space. ProCLUS is effective for high-dimensional data
because it reduces the dimensionality of the data while preserving the clustering structure.
Both CLIQUE and ProCLUS are designed to handle high-dimensional data by identifying clusters in
subspaces of the data. They are effective for clustering data that have a natural subspace structure. However,
they may not work well for data that do not have a clear subspace structure or when the data points are
widely spread out in the high-dimensional space. It is important to carefully choose the appropriate algorithm
Frequent pattern-based clustering methods combine frequent pattern mining with clustering techniques to
identify clusters based on frequent patterns in the data. Here are some examples of frequent pattern-based
13
clustering methods:
1. Frequent Pattern-based Clustering: is a clustering algorithm that uses frequent pattern mining to
identify clusters in transactional data. The algorithm first identifies frequent itemsets in the data
using Apriori or FP-Growth algorithms. It then constructs a graph where each frequent itemset is a
node, and the edges represent the overlap between the itemsets. The graph is partitioned into clusters
using a graph clustering algorithm. The resulting clusters are then used to assign objects to clusters
2. Frequent Pattern-based Clustering Method: is a clustering algorithm that uses frequent pattern mining
to identify clusters in high-dimensional data. The algorithm first discretizes the continuous data into
categorical data. It then uses Apriori or FP-Growth algorithms to identify frequent itemsets in the
categorical data. The frequent itemsets are used to construct a binary matrix that represents the
membership of objects in the frequent itemsets. The binary matrix is clustered using a standard
clustering algorithm, such as K-Means or Hierarchical clustering. The resulting clusters are then used
3. Clustering based on Frequent Pattern Combination: is a clustering algorithm that combines frequent
AF
pattern mining with pattern combination techniques to identify clusters in transactional data. The
DR
algorithm first identifies frequent itemsets in the data using Apriori or FP-Growth algorithms. It
then uses pattern combination techniques, such as Minimum Description Length (MDL) or Bayesian
Information Criterion (BIC), to generate composite patterns from the frequent itemsets. The composite
patterns are then used to construct a graph, which is partitioned into clusters using a graph clustering
algorithm.
Frequent pattern-based clustering methods are effective for identifying clusters based on frequent patterns
in the data. They can be applied to a wide range of data types, including transactional data and high-
dimensional data. However, these methods may suffer from the curse of dimensionality when applied to
high-dimensional data. It is important to carefully select the appropriate frequent pattern mining and
clustering techniques based on the characteristics of the data and the clustering objectives.
14
10 Clustering in non-Euclidean space
Clustering in non-Euclidean space refers to the clustering of data points that are not represented in the
Euclidean space, such as graphs, time series, or text data. Traditional clustering algorithms, such as K-
Means and Hierarchical clustering, assume that the data points are represented in the Euclidean space and
use distance metrics, such as Euclidean distance or cosine similarity, to measure the similarity between data
points. However, in non-Euclidean spaces, the notion of distance is different, and distance-based clustering
1. Spectral clustering: Spectral clustering is a popular clustering algorithm that can be applied to data
represented in non-Euclidean spaces, such as graphs or time series. It uses the eigenvalues and eigen-
vectors of the Laplacian matrix of the data to identify clusters. Spectral clustering converts the data
points into a graph representation and then computes the Laplacian matrix of the graph. The eigen-
vectors of the Laplacian matrix are used to embed the data points into a lower-dimensional space,
where clustering is performed using a standard clustering algorithm, such as K-Means or Hierarchical
clustering.
T
AF
that can be applied to data represented in non-Euclidean spaces. It does not rely on a distance
metric and can cluster data points based on their density. DBSCAN identifies clusters by defining two
parameters: the minimum number of points required to form a cluster and a radius that determines
the neighborhood of a point. DBSCAN labels each point as either a core point, a border point, or a
noise point, based on its neighborhood. The core points are used to form clusters.
3. Topic modeling: Topic modeling is a clustering method that can be applied to text data, which is
typically represented in a non-Euclidean space. Topic modeling identifies latent topics in the text data
by analyzing the co-occurrence of words. It represents each document as a distribution over topics,
and each topic as a distribution over words. The resulting topic distribution of each document can be
Clustering in non-Euclidean spaces requires careful consideration of the appropriate algorithms and tech-
niques that are suitable for the specific data type. Spectral clustering and DBSCAN are effective for clustering
15
data represented as graphs or time series, while topic modeling is suitable for text data. Other approaches,
such as manifold learning and kernel methods, can also be used for clustering in non-Euclidean spaces.
Clustering for streams and parallelism are two important considerations for clustering large datasets. Stream
data refers to data that arrives continuously and in real-time, while parallelism refers to the ability to
1. Online clustering: Online clustering is a technique that can be applied to streaming data. It updates
the clustering model continuously as new data arrives. Online clustering algorithms, such as BIRCH
and CluStream, are designed to handle data streams and can scale to large datasets. These algo-
rithms incrementally update the cluster model as new data arrives and discard outdated data points
2. Parallel clustering: Parallel clustering refers to the use of multiple computing resources, such as multiple
T
processors or computing clusters, to speed up the clustering process. Parallel clustering algorithms,
AF
such as K-Means Parallel, Hierarchical Parallel, and DBSCAN Parallel, distribute the clustering task
DR
across multiple computing resources. These algorithms partition the data into smaller subsets and
assign each subset to a separate computing resource. The resulting clusters are then merged to produce
3. Distributed clustering: Distributed clustering refers to the use of multiple computing resources that
are distributed across different physical locations, such as different data centers or cloud resources.
Distributed clustering algorithms, such as MapReduce and Hadoop, distribute the clustering task
across multiple computing resources and handle data that is too large to fit into a single computing
resource’s memory. These algorithms partition the data into smaller subsets and assign each subset to
a separate computing resource. The resulting clusters are then merged to produce the final clustering
result.
Clustering for streams and parallelism requires careful consideration of the appropriate algorithms and
techniques that are suitable for the specific clustering objectives and data types. Online clustering is effective
16
for clustering streaming data, while parallel clustering and distributed clustering can speed up the clustering
Q1: Write R function to check whether the given number is prime or not.
flag = 0
for(i in 2:(num-1)) {
if ((num %% i) == 0)
flag = 0
break
}
T
AF
if(num == 2) flag = 1
DR
if(flag == 1)
else
17
Apriori algorithm:—The apriori algorithm solves the frequent item sets problem. The algorithm ana-
lyzes a data set to determine which combinations of items occur together frequently. The Apriori algorithm
is at the core of various algorithms for data mining problems. The best known problem is finding the asso-
Numerical:
Given:
ITERATION:1
STEP 1: (C1)
ITERATION 2:
Itemsets Counts STEP 3: (C2)
A 1 Itemsets Counts
C 2 STEP 2: (L2) STEP 4: (L2)
E, K 4
D 1 ITERATION 3:
Itemsets Counts E, M 2 Itemsets Counts
E 4 STEP 5: (C3)
E 4 E, O 3 E, K 4
I 1 Itemsets Counts
K 5 E, Y 2 E, O 3
K 5 E, K, O 3
M 3 K, M 3 K, M 3
M 3 K, M, O 1
O 3 K, O 3 K, O 3
N 2 K, M, Y 2
T
Y 3 K, Y 3 K, Y 3
O 3
AF
M, O 1
U 1
M, Y 2
DR
Y 3
O, Y 2
STEP 6: (L3)
Itemsets Counts
E, K, O 3
ASSOCIATION RULE:
18
5. K → [E, O] = 3/5 = 60%
T
AF
DR
19
DATABASE
SYSTEMS
What is Frequent Itemset Mining?
GROUP
DATABASE
SYSTEMS
Example: Basket Data Analysis
GROUP
• Transaction database
D= {{butter, bread, milk, sugar};
{butter, flour, milk, sugar};
{butter, eggs, milk, salt};
{eggs};
{butter, flour, milk, salt, sugar}}
1) Introduction
– Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
– Basic notions, rule generation, interestingness measures
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Extensions and Summary
T
Outline 5
AF
Notions
DATABASE
SYSTEMS
GROUP
• Naïve Algorithm
– count the frequency of all possible subsets of 𝐼 in the database
too expensive since there are 2m such itemsets for |𝐼| = 𝑚 items
cardinality of power set
• The Apriori principle (anti-monotonicity):
Any non-empty subset of a frequent itemset is frequent, too!
A ⊆ I with support A ≥ minSup ⇒ ∀A′ ⊂ A ∧ A′ ≠ ∅: support A′ ≥ minSup
Any superset of a non-frequent itemset is non-frequent, too!
A ⊆ I with support A < minSup ⇒ ∀A′ ⊃ A: support A′ < minSup
ABCD not frequent
• Method based on the apriori principle ABC ABD ACD BCD
– First count the 1-itemsets, then the 2-itemsets,
AB AC AD BC BD CD
then the 3-itemsets, and so on
A B C D
– When counting (k+1)-itemsets, only consider those
(k+1)-itemsets where all subsets of length k have been Ø
determined as frequent in the previous step
T
DATABASE
SYSTEMS
The Apriori Algorithm
GROUP
L1 = {frequent items}
for (k = 1; Lk !=; k++) do begin
// JOIN STEP: join Lk with itself to produce Ck+1
produce // PRUNE STEP: discard (k+1)-itemsets from Ck+1 that
candidates contain non-frequent k-itemsets as subsets
Ck+1 = candidates generated from Lk
DATABASE
SYSTEMS
Generating Candidates (Prune Step)
GROUP
Candidates?
DATABASE
SYSTEMS
GROUP
DATABASE
SYSTEMS
Hash-Tree – Counting
GROUP
• Search all candidate itemsets contained in a transaction T = (t1 t2 ... tn) for a
current itemset length of k
• At the root
– Determine the hash values for each item t1 t2 ... tn-k+1 in T
– Continue the search in the resulting child nodes
• At an internal node at level d (reached after hashing of item 𝑡𝑖 )
– Determine the hash values and continue the search for each item 𝑡𝑗 with 𝑖 < 𝑗 ≤ 𝑛 −
𝑘+𝑑
• At a leaf node
– Check whether the itemsets in the leaf node are contained in transaction T
012
in our example n=5 and k=3 3 1,7
h(K) = K mod 3 012 012 012
9 7 3,9 7
Transaction (1, 3, 7, 9, 12) (3 6 7) 0 1 2 (3 5 7) (7 9 12) (1 4 11) (7 8 9) (2 3 8) 0 1 2 (2 5 6)
(3 5 11) (1 6 11) (1 7 9) (1 11 12) (5 6 7) (2 5 7)
(5 8 11)
9,12
Tested leaf nodes (3 4 15) (3 7 11) (2 4 6) (2 4 7)
(3 4 11) (2 7 9) (5 7 10)
Pruned subtrees (3 4 8)
Candidate Generation
DATABASE
SYSTEMS
GROUP
• Idea:
– Compress database into FP-tree, retaining the itemset association
information
– Divide the compressed database into conditional databases, each associated
with one frequent item and mine each such database separately.
header table:
1&2 item frequency
f 4
c 4 sort items in the order
a 3 of descending support
minSup=0.5 b 3
m 3
p 3
T
DB
DATABASE
SYSTEMS
GROUP
DATABASE
SYSTEMS
Benefits of the FP-tree Structure
GROUP
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be
shared
– never be larger than the original database (if not count node-links and
counts)
– Experiments demonstrate compression ratios over 100
DATABASE
SYSTEMS
Major Steps to Mine FP-tree
GROUP
Pattern Bases
DATABASE
SYSTEMS
GROUP
• Node-link property
– For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in the
FP-tree header
• Prefix path property
– To calculate the frequent patterns for a node ai in a path P, only the prefix
sub-path of ai in P needs to be accumulated, and its frequency count should
carry the same count as node ai.
Conditional FP-tree
DATABASE
SYSTEMS
GROUP
c:3 c:3
a:3
Frequent Itemset Mining Algorithms FP-Tree 26
DATABASE
SYSTEMS
Major Steps to Mine FP-tree
GROUP
example:
m-conditional FP-tree All frequent patterns
{}|m concerning m
m,
f:3 just a single path
fm, cm, am,
c:3 fcm, fam, cam,
fcam
a:3
T
DATABASE
SYSTEMS
FP-tree: Full Example
GROUP
database:
TID items bought (ordered) frequent items
100 {b, c, f} {f, b, c}
200 {a, b, c} {b, c}
300 {d, f} {f}
400 {b, c, e, f} {f, b, c}
500 {f, g} {f} {}
c:2
conditional pattern base:
item cond. pattern base
f {}
b f:2, {}
c fb:2, b:1
c:2
{{fc}} {{bc},{fbc}}
f:2
T
Growth
DATABASE
SYSTEMS
GROUP
70
tree-projection
Run time(sec.)
60
50 D1 FP-grow th runtime
40 D1 Apriori runtime
30
20
10
0
• Reasoning
0 0,5 1 1,5 2 2,5 3
Support threshold(%)
DATABASE
SYSTEMS
Maximal or Closed Frequent Itemsets
GROUP
1) Introduction
– Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
– Basic notions, rule generation, interestingness measures
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Extensions and Summary
T
Outline 33
AF
Introduction
DATABASE
SYSTEMS
GROUP
• Transaction database:
D= {{butter, bread, milk, sugar};
{butter, flour, milk, sugar};
{butter, eggs, milk, salt};
{eggs};
{butter, flour, milk, salt, sugar}}
• Frequent itemsets: items support
{butter} 4
{milk} 4
{butter, milk} 4
{sugar} 3
{butter, sugar} 3
{milk, sugar} 3
{butter, milk, sugar} 3
• Question of interest:
– If milk and sugar are bought, will the customer always buy butter as well?
𝑚𝑖𝑙𝑘, 𝑠𝑢𝑔𝑎𝑟 ⇒ 𝑏𝑢𝑡𝑡𝑒𝑟 ?
– In this case, what would be the probability of buying butter?
DATABASE
SYSTEMS
Interestingness of Association Rules
GROUP
rule candidates: A ⇒ 𝐵; 𝐵 ⇒ 𝐴; A ⇒ 𝐶; 𝐶 ⇒ A; 𝐵 ⇒ 𝐶; C ⇒ 𝐵;
𝐴, 𝐵 ⇒ 𝐶; 𝐴, 𝐶 ⇒ 𝐵; 𝐶, 𝐵 ⇒ 𝐴; 𝐴 ⇒ 𝐵, 𝐶; 𝐵 ⇒ 𝐴, 𝐶; 𝐶 ⇒ 𝐴, 𝐵
T
Itemsets
DATABASE
SYSTEMS
GROUP
• Objective measures
– Two popular measurements:
– support and
– confidence
DATABASE
SYSTEMS
Criticism to Support and Confidence
GROUP
Correlation
DATABASE
SYSTEMS
GROUP
• Example 2: X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
• X and Y: positively correlated
• X and Z: negatively related
• support and confidence of X=>Z dominates
• but items X and Z are negatively correlated
• Items X and Y are positively correlated
1) Introduction
– Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
– Basic notions, rule generation, interestingness measures
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Extensions and Summary
T
Outline 43
AF
Motivation
DATABASE
SYSTEMS
GROUP
jackets jeans
• Examples
Jeans boots
jackets boots Support < minSup
• Characteristics
– Support(“outerwear boots”) is not necessarily equal to the sum
support(“jackets boots”) + support( “jeans boots”)
e.g. if a transaction with jackets, jeans and boots exists
– Support for sets of generalizations (e.g., product groups) is higher
than support for sets of individual items
If the support of rule “outerwear boots” exceeds minsup, then the
support of rule “clothes boots” does, too
T
DATABASE
SYSTEMS
Mining Multi-Level Associations
GROUP
Reduced Support
DATABASE
SYSTEMS
GROUP
Filtering
DATABASE
SYSTEMS
GROUP
Let 𝑋, 𝑋 ′ , 𝑌, 𝑌 ′ ⊆ 𝐼 be itemsets.
• An itemset 𝑋′ is an ancestor of 𝑋 iff there exist ancestors 𝑥1′ , … , 𝑥𝑘′ of
𝑥1 , … , 𝑥𝑘 ∈ 𝑋 and 𝑥𝑘+1 , … , 𝑥𝑛 with 𝑛 = 𝑋 such that
𝑋 ′ = {𝑥1′ , … , 𝑥𝑘′ , 𝑥𝑘+1 , … , 𝑥𝑛 }.
• Let 𝑋 ′ and 𝑌′ be ancestors of 𝑋 and 𝑌. Then we call the rules 𝑋′ 𝑌′,
𝑋𝑌′, and 𝑋′𝑌 ancestors of the rule X Y .
• The rule X´ Y´ is a direct ancestor of rule X Y in a set of rules if:
– Rule X´ Y‘ is an ancestor of rule X Y, and
– There is no rule X“ Y“ such that X“ Y“ is an ancestor of
X Y and X´ Y´ is an ancestor of X“ Y“
• A hierarchical association rule X Y is called R-interesting if:
– There are no direct ancestors of X Y or
– The actual support is larger than R times the expected support or
– The actual confidence is larger than R times the expected confidence
T
Confidence
DATABASE
SYSTEMS
GROUP
Interestingness of Hierarchical
DR
Association Rules:Example
DATABASE
SYSTEMS
GROUP
•
No rule support R-interesting?
1 clothes shoes 10 yes: no ancestors
2 outerwear shoes 9 yes:
Support > R *exp. support (wrt. rule 1) =
10
(1.6 ⋅ (20 ⋅ 10)) = 8
3 jackets shoes 4 Not wrt. support:
Support > R * exp. support (wrt. rule 1) = 3.2
Support < R * exp. support (wrt. rule 2) = 5.75
still need to check the confidence!
1) Introduction
– Transaction databases, market basket data analysis
2) Simple Association Rules
– Basic notions, rule generation, interestingness measures
3) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness
– Multidimensional and Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Summary
T
Outline 55
AF
Multi-Dimensional Association:
DR
Concepts
DATABASE
SYSTEMS
GROUP
• Single-dimensional rules:
– buys milk buys bread
DATABASE
SYSTEMS
Quantitative Association Rules
GROUP
• Static discretization
– Discretization of all attributes before mining the association rules
– E.g. by using a generalization hierarchy for each attribute
– Substitute numerical attribute values by ranges or intervals
• Dynamic discretization
– Discretization of the attributes during association rule mining
– Goal (e.g.): maximization of confidence
– Unification of neighboring association rules to a generalized rule
DATABASE
SYSTEMS
Partitioning of Numerical Attributes
GROUP
• Solution
– First, partition the domain into many intervals
– Afterwards, create new intervals by merging adjacent interval
• Example:
DATABASE
SYSTEMS
Chapter 3: Frequent Itemset Mining
GROUP
1) Introduction
– Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
– Basic notions, rule generation, interestingness measures
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Summary
Outline 62