Unit 4 - Part 1
Unit 4 - Part 1
• A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it
occurs frequently in a shopping history database, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences.
• Finding such frequent patterns plays an essential role in mining associations, correlations, and many
other interesting relationships among data.
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis.
• Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets, which help in many business decision-making processes,
on customer shopping behavior analysis.
Level 1
Level 2
Level 3
Level 4
Level 5
Given I items, there are
2I-1 candidate
itemsets!
Frequent Itemset Identification: Brute-Force Approach
• Brute-force approach:
• Set up a counter for each itemset in the lattice
• Scan the database once, for each transaction T,
• check for each itemset S whether T⊇ S
• if yes, increase the counter of S by 1
Level 1
Found to be
Infrequent
Pruned
Supersets
An Example
For rule A ⇒ C:
support = support({A λC}) = 50%
confidence = support({A λC})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
•Calculate Support for Single Items:
•Support({A}): Appears in 3/4 transactions = 75%
•Support({B}): Appears in 2/4 transactions = 50%
•Support({C}): Appears in 2/4 transactions = 50%
•Support({D}): Appears in 1/4 transactions = 25%
•Support({E}): Appears in 1/4 transactions = 25%
•Support({F}): Appears in 1/4 transactions = 25%
Only {A}, {B}, and {C} meet the minimum support of 50%.
•Generate Candidate Pairs and Calculate Support:
•Support({A, B}): Appears in 1/4 transactions = 25% (does not meet the support threshold)
•Support({A, C}): Appears in 2/4 transactions = 50% (meets the support threshold)
•Support({A, D}): Appears in 1/4 transactions = 25% (does not meet the support threshold)
•Support({B, C}): Appears in 1/4 transactions = 25% (does not meet the support threshold)
Only {A, C} meets the minimum support of 50%.
1.Frequent Itemsets:
•Frequent Single Items: {A}, {B}, {C}
•Frequent Pair: {A, C}
2.Generate Association Rules and Calculate Confidence:
•Rule A ⇒ C:
•Support({A, C}) = 50%
•Confidence = Support({A, C}) / Support({A}) = 50% / 75% = 66.6% (meets the confidence threshold)
Thus, the only frequent itemsets are {A}, {B}, {C}, and {A, C}, and the rule A ⇒ C has sufficient support and confidence.
Finding frequent itemsets using the Apriori
Algorithm: Example
▪ In the first iteration of the algorithm, each item is a member of the set of
candidates Ck along with its support count.
▪ The set of frequent 1-itemsets L1, consists of the candidate 1-itemsets
satisfying minimum support.
Step 2: Generating candidate and frequent 2-itemsets with min.
support = 2
Generate C2 Scan D for Compare
Itemset Itemset Sup. Itemset Sup
candidates count of candidate
{I1, I2} Count support count Count
from L1 x L1 each
candidate {I1, I2} 4 with minimum {I1, I2} 4
{I1, I3} support count
{I1, I4} {I1, I3} 4 {I1, I3} 4
C2
Step 3: Generating candidate and frequent
3-itemsets with min. support = 2
Compare
Generate C3 Scan D for candidate
candidates count of support count
from L2 Itemset each Itemset Sup. Itemset Sup
with min
{I1, I2, I3} candidate Count support count Count
{I1, I2, I5} {I1, I2, I3} 2 {I1, I2, I3} 2
{I1, I3, I5} {I1, I2, I5} 2 {I1, I2, I5} 2
{I2, I3, I4}
C3 L3
{I2, I3, I5}
{I2, I4, I5}
Contains non-frequent
C3 (2-itemset) subsets
▪ The generation of the set of candidate 3-itemsets C3, involves use of the Apriori
Property.
▪ When Join step is complete, the Prune step will be used to reduce the size of C3.
Prune step helps to avoid heavy computation due to large Ck.
Step 4: Generating frequent 4-itemset
• L3 Join L3 C4 = {{I1, I2, I3, I5}}
• This itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.
• Back To Example:
• Lets take l = {I1,I2,I5}
• The nonempty subsets of Lets take l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2},
{I5}
Step 5: Generating Association Rules from frequent k-itemsets
[Cont.]
• R4: I1 🡪 I2 ^ I5
• Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33%
• R4 is Rejected.
• R5: I2 🡪 I1 ^ I5
• Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29%
• R5 is Rejected.
• R6: I5 🡪 I1 ^ I2
• Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100%
• R6 is Selected.
• Even when you cure the memory issues, you may need to deal with another limited resource: time. Although a computer
may think you live for millions of years, in reality you won’t (unless you go into cryostasis until your PC is done). Certain
algorithms don’t take time into account; they’ll keep running forever. Other algorithms can’t end in a reasonable amount of
time when they need to process only a few megabytes of data.
• A third thing you’ll observe when dealing with large data sets is that components of your computer can start to form a
bottleneck while leaving other systems idle. Although this isn’t as severe as a never-ending algorithm or out-of-memory
errors, it still incurs a serious cost. Think of the cost savings in terms of person days and computing infrastructure for CPU
starvation. Certain programs don’t feed data fast enough to the processor because they have to read data from the hard
drive, which is one of the slowest components on a computer. This has been addressed with the introduction of solid state
drives (SSD), but SSDs are still much more expensive than the slower and more widespread hard disk drive (HDD)
technology.
General techniques for handling large
volumes of data
• Never-ending algorithms, out-of-memory errors, and speed
issues are the most common challenges you face when working
with large data. In this section, we’ll investigate solutions to
overcome or alleviate these problems.
• The solutions can be divided into three categories: using the
correct algorithms, choosing the right data structure, and using
the right tools
• No clear one-to-one mapping exists between the problems and solutions
because many solutions address both lack of memory and computational
performance.
• For instance, data set compression will help you solve memory issues because
the data set becomes smaller.
• But this also affects computation speed with a shift from the slow hard disk to
the fast CPU.
• Contrary to RAM (random access memory), the hard disc will store everything
even after the power goes down, but writing to disc costs more time than
changing information in the fleeting RAM.
• When constantly changing the information, RAM is thus preferable over the
(more durable) hard disc.
• With an unpacked data set, numerous read and write operations (I/O) are
occurring, but the CPU remains largely idle, whereas with the compressed data
set the CPU gets its fair share of the workload.
CHOOSING THE RIGHT ALGORITHM
• Choosing the right algorithm can solve more problems than adding
more or better hardware.
• An algorithm that’s well suited for handling large data doesn’t need to
load the entire data set into memory to make predictions.
• Ideally, the algorithm also supports parallelized calculations.
• Three types of algorithms that can do that: online algorithms, block
algorithms, and MapReduce algorithms
Online learning algorithms
• Several, but not all, machine learning algorithms can be trained using one
observation at a time instead of taking all the data into memory.
• Upon the arrival of a new data point, the model is trained and the observation
can be forgotten; its effect is now incorporated into the model’s parameters.
• For example, a model used to predict the weather can use different parameters
(like atmospheric pressure or temperature) in different regions.
• When the data from one region is loaded into the algorithm, it forgets about
this raw data and moves on to the next region.
• This “use and forget” way of working is the perfect solution for the memory
problem as a single observation is unlikely to ever be big enough to fill up all
the memory of a modern-day computer.
Main Memory
Handling Larger Datasets in Main Memory
• The A-Priori Algorithm is fine as long as the step with the greatest
requirement for main memory – typically the counting of the
candidate pairs C2– has enough memory that it can be accomplished
without thrashing (repeated moving of data between disk and main
memory). Several algorithms have been proposed to cut down on the
size of candidate set C2.
• Here, we consider the PCY Algorithm, which takes advantage of the
fact that in the first pass of A-Priori there is typically lots of main
memory not needed for the counting of single items.
• Then we look at the Multistage Algorithm, which uses the PCY trick
and also inserts extra passes to further reduce the size of C2
PCY (Park-Chen-Yu) Algorithm
• The PCY (Park-Chen-Yu) algorithm is a method used in data
analytics to identify frequent itemsets within large datasets efficiently.
• It is particularly useful in market basket analysis, where it helps in
discovering combinations of items frequently purchased together,
such as shirts and jeans.
• The algorithm improves performance by using a two-phase
approach:
• first, it hashes pairs of items to count their occurrences and uses a hash table
to reduce the number of candidate pairs, and
• Second, it scans the dataset again to determine the actual frequent itemsets.
• If there are a million items and gigabytes of main memory, we do not need
more than 10% of the main memory.
• The PCY algorithm uses hashing to efficiently count item set
frequencies and reduce overall computational cost.
• The basic idea is to use a hash function to map itemsets to
hash buckets, followed by a hash table to count the frequency
of itemsets in each bucket.
• Problem:Apply the PCY algorithm on the following transaction to find the candidate sets
(frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
• Step 1: Find the frequency of each element and remove the candidate set having
length 1.
Step 2: One by one transaction-wise, create all the possible pairs and
corresponding to them write their frequency. Note - Note: Pairs should not get
repeated avoid the pairs that are already written before.
Step 3: List all sets whose length is greater than or equal to the threshold and
then apply Hash Functions. (It gives us the bucket number). It defines in what
bucket this particular pair will be put.
• Step 4: This is the last step, and in this step, we have to create a table with the
following details –
• Bit vector - if the frequency of the candidate pair is greater than equal to the
threshold then the bit vector is 1 otherwise 0. (mostly 1)
• Bucket number - found in the previous step
• Candidate set - if the bit vector is 1, then "correct" will be written here.
• Step 1: Find the frequency of • Step 2: One by one
each element and remove the transaction-wise, create all
candidate set having length the possible pairs and
1. corresponding to it write its
frequency.
• Step 3: List all sets whose length is greater than the threshold
and then apply Hash Functions. (It gives us the bucket number).
• Hash Function = ( i * j) mod 10
• (1, 3) = (1*3) mod 10 = 3
• (2,3) = (2*3) mod 10 = 6
• (2,4) = (2*4) mod 10 = 8
• (3,4) = (3*4) mod 10 = 2
• (3,5) = (3*5) mod 10 = 5
• (4,5) = (4*5) mod 10 = 0
• (4,6) = (4*6) mod 10 = 4
• Step 4: Prepare candidate set
The SON Algorithm and Map – Reduce
• The SON algorithm work well in a parallel-computing
environment.
• Each of the chunks can be processed in parallel, and the
frequent itemsets from each chunk combined to form the
candidates.
• We can distribute the candidates to many processors, have
each processor count the support for each candidate in a
subset of the baskets, and finally sum those supports to get the
support for each candidate itemset in the whole dataset.
• There is a natural way of expressing each of the two passes as
a MapReduce operation
MapReduce-MapReduce sequence
• First Map function :-
• Take the assigned subset of the baskets and find the itemsets frequent in the
subset using the simple and randomized algorithm.
• Lower the support threshold from s to ps if each Map task gets fraction p of
the total input file.
• The output is a set of key-value pairs (F, 1), where F is a frequent itemset
from the sample.
• First Reduce Function :-
• Each Reduce task is assigned a set of keys, which are itemsets.
• The value is ignored, and the Reduce task simply produces those keys
(itemsets) that appear one or more times.
• Thus, the output of the first Reduce function is the candidate itemsets.
• Second Map function :-
• The Map tasks for the second Map function take all the output from the first Reduce
Function (the candidate itemsets) and a portion of the input data file.
• Each Map task counts the number of occurrences of each of the candidate itemsets
among the baskets in the portion of the dataset that it was assigned.
• The output is a set of key-value pairs (C, v), where C is one of the candidate sets
and v is the support for that itemset among the baskets that were input to this Map
task.
• Second Reduce function :-
• The Reduce tasks take the itemsets they are given as keys and sum the associated
values.
• The result is the total support for each of the itemsets that the Reduce task was
assigned to handle.
• Those itemsets whose sum of values is at least s are frequent in the whole dataset,
so the Reduce task outputs these itemsets with their counts.
• Itemsets that do not have total support at least s are not transmitted to the output of
the Reduce task.
Clustering Techniques
What is Cluster Analysis?
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined
classes
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
• Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Incorporation of user-specified constraints
• Interpretability and usability
Type of data in clustering analysis
• Interval-scaled variables:
• Binary variables:
• Standardize data
• Calculate the mean absolute deviation:
where
• Calculate the standardized measurement (z-score)
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
Similarity and Dissimilarity Between
Objects (Cont.)
• If q = 2, d is Euclidean distance:
• Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
Object i
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
• f is interval-based: use the normalized distance
• f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled
Major Clustering Approaches
• Strength
• Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
• Applicable only when mean is defined, then what about
categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
• A few variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical
objects
• Using a frequency-based method to update modes of clusters
• A mixture of categorical and numerical data: k-prototype
method
The K-Medoids Clustering Method
j
t t
j
i h h
i
h
j
i
i h j
t
t
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a
termination condition
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree Root
Non-leaf node
prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
CURE (Clustering Using
REpresentatives )
y
y y
y y
x x
x x
Cure: Shrinking Representative Points
y
y
x
x
Data Set
Merge Partition
Final Clusters
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
• Several interesting studies:
• DBSCAN: Ester, et al. (KDD’96)
• OPTICS: Ankerst, et al (SIGMOD’99).
• DENCLUE: Hinneburg & D. Keim (KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98)
Density-Based Clustering: Background
• Two parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-neighbourhood
of that point
• NEps(p): {q belongs to D | dist(p,q) <= Eps}
• Directly density-reachable: A point p is directly density-reachable
from a point q wrt. Eps, MinPts if
• 1) p belongs to NEps(q)
• 2) core point condition: p MinPts = 5
q Eps = 1 cm
|NEps (q)| >= MinPts
Density-Based Clustering: Background (II)
• Density-reachable:
p
• A point p is density-reachable from a
point q wrt. Eps, MinPts if there is a p1
q
chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
• Density-connected
• A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o wrt. Eps and
MinPts.
DBSCAN: Density Based Spatial Clustering
of Applications with Noise
Outlier
Border
Eps = 1cm
MinPts = 5
Core
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p wrt Eps and
MinPts.
• If p is a core point, a cluster is formed.
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database.
• Continue the process until all of the points have been
processed.
OPTICS: A Cluster-Ordering Method (1999)
• Index-based:
• k = number of dimensions
• N = 20
• p = 75%
• M = N(1-p) = 5 D
• Complexity: O(kN2)
• Core Distance
p1
• Reachability Distance
o
p2
o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
ε = 3 cm
Reachability-dis
tance
undefined
Cluster-order
of the objects
DENCLUE: using density functions
• DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
• Major features
• Solid mathematical foundation
• Good for data sets with large amounts of noise
• Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
• Significant faster than existing algorithm (faster than DBSCAN
by a factor of up to 45)
• But needs a large number of parameters
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a
tree-based access structure.
• Influence function: describes the impact of a data point within its
neighborhood.
• Overall density of the data space can be calculated as the sum of
the influence function of all data points.
• Clusters can be determined mathematically by identifying density
attractors.
• Density attractors are local maximal of the overall density function.
Gradient: The steepness of a slope
• Example
Density Attractor
Center-Defined and Arbitrary
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical Information Grid
Approach
τ=3
0 1 2 3 4 5 6 7
20
30
40
S
50
ala
r
Vacation
y
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
Strength and Weakness of CLIQUE
• Strength
• It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
• It is insensitive to the order of records in input and does
not presume some canonical data distribution
• It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
• The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Model-Based Clustering Methods
• Attempt to optimize the fit between the data and some
mathematical model
• Statistical and AI approach
• Conceptual clustering
• A form of clustering in machine learning
• Produces a classification scheme for a set of unlabeled objects
• Finds characteristic description for each concept (class)
• COBWEB (Fisher’87)
• A popular a simple method of incremental conceptual learning
• Creates a hierarchical clustering in the form of a classification tree
• Each node refers to a concept and contains a probabilistic description
of that concept
COBWEB Clustering Method
A classification tree
More on Statistical-Based Clustering
• Limitations of COBWEB
• The assumption that the attributes are independent of
each other is often too strong because correlation may
exist
• Not suitable for clustering large database data – skewed
tree and expensive probability distributions
• CLASSIT
• an extension of COBWEB for incremental clustering of
continuous data
• suffers similar problems as COBWEB
• AutoClass (Cheeseman and Stutz, 1996)
• Uses Bayesian statistical analysis to estimate the number
of clusters
• Popular in industry
Other Model-Based Clustering
Methods
• Neural network approaches
• Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
• New objects are distributed to the cluster whose
exemplar is the most similar according to some dostance
measure
• Competitive learning
• Involves a hierarchical architecture of several units
(neurons)
• Neurons compete in a “winner-takes-all” fashion for the
object currently being presented
Model-Based Clustering Methods
Self-organizing feature maps (SOMs)