DA Unit 4
DA Unit 4
Unit 4
Frequent Itemsets and Clustering
Syllabus
• For Example, bread and butter, laptop and an antivirus software, etc.
Frequently used Itemset
• Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A and B are
the frequent itemsets together, then individually A and B should also be the
frequent itemset.
Frequently used Itemset (Contd…)
• For the frequent itemset mining method, consider only those transaction that
meet minimum threshold support and confidence requriments. Insights from
these mining algorithms offer a lot of benefits, including cost-cutting and
improved competitive advantage.
Frequent Pattern Mining (FPM)
• The frequent pattern mining algorithm is one of the most important techniques
of data mining to discover relationship between different items in a dataset.
These relationships are represented in the form of association rules. It helps to
find the irregularities in data.
• FPM has many applications in the fields of data analysis, software bugs, cross-
marketing, sale campaign analysis, market basket analysis, etc.
• Frequent itemsets discovered through Apriori have many applications in data
mining tasks. Tasks such as finding interesting patterns in the database, finding
out sequences and mining of association rules – are the most important among
them.
• Association rules apply to supermarket transaction data, that is, to examine
the customer behaviour in terms of the purchased products. Association rules
describe how often the items are purchased together.
Frequent Item set in Data set (Association Rule Mining)
• Frequent item sets, also known as association rules, are a fundamental concept
in association rule mining, which is a technique used in data mining to
discover relationships between items in a dataset. The goal of association
rule mining is to identify relationships between items in a dataset that occur
frequently together.
• A frequent item set is a set of items that occur together frequently in a
dataset. The frequency of an item set is measured by the support count, which
is the number of transactions or records in the dataset that contain the item set.
For example, if a dataset contains 100 transactions and the item set {milk,
bread} appears in 20 of those transactions, the support count for {milk, bread}
is 20.
Frequent Item set in Data set (Association Rule Mining) (Contd…)
• Frequent item sets and association rules can be used for a variety of tasks
such as market basket analysis, cross-selling and recommendation systems.
• Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.
• K- Itemset: Itemset which contains K items is a K-itemset. So, it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.
Market-Based Modeling
• In simple terms market basket analysis in data mining and data analysis is to
analyse the combination of products that have been bought together. This is a
technique that gives a careful study of purchases made by a customer in a
supermarket. This concept identifies the pattern of frequent purchase of items
by customers. This analysis can help to promote deals, offers, and sales by the
companies and data mining techniques help to achieve this task.
• Example: Data mining concepts are in use for sales and marketing to provide
better customer service, to improve cross-selling opportunities, to increase
direct mail response rates.
Terminologies used with Market-Based Modeling
Market basket analysis mainly works
with the ASSOCIATION RULE {IF}
-> {THEN}.
• IF means Antecedent: An
antecedent is an item found within
the data.
• THEN means Consequent: A
consequent is an item found in
combination with the antecedent.
Types of Market Basket Analysis
There are three types of Market Basket Analysis. They are as follow:
1. Descriptive market basket analysis: This type only derives insights from
past data and is most frequently used approach. This kind of study is mostly
used to understand consumer behavior, including what products are purchased
in combination and what the most typical item combinations. Retailers can
place products in their stores more profitably by understanding which products
are frequently bought together with the aid of descriptive market basket
analysis. This type of modelling is known as unsupervised learning.
Types of Market Basket Analysis (Contd…)
1. Retail: Market basket research is frequently used in the retail sector to examine
consumer buying patterns and inform decisions about product placement,
inventory management, and pricing tactics. Retailers can utilize market basket
research to identify which items are sluggish sellers and which ones are commonly
bought together and then modify their inventory management strategy accordingly.
2. E-commerce: Market basket analysis can help online merchants better understand
the customer buying habits and make data-driven decisions about product
recommendations and targeted advertising campaigns. The behaviour of visitors to
a website can be examined using market basket analysis to pinpoint problem areas.
Applications of Market Basket Analysis (Contd…)
3. Finance: Market basket analysis can be used to evaluate investor behaviour and forecast the
types of investment items that investors will likely buy in the future. The performance of
investment portfolios can be enhanced by using this information to create tailored investment
strategies.
4. Telecommunications: To evaluate consumer behaviour and make data-driven decisions about
which goods and services to provide, the telecommunications business might employ market
basket analysis. The usage of this data can enhance client happiness and the shopping
experience.
5. Manufacturing: To evaluate consumer behaviour and make data-driven decisions about
which products to produce and which materials to employ in the production process, the
manufacturing sector might use market basket analysis. Utilizing this knowledge will increase
effectiveness and cut costs.
Apriori Algorithm
• The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets from
the large dataset.
• This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is
mainly used for market basket analysis and helps to find those products that can be
bought together. It can also be used in the healthcare field to find drug reactions for
patients.
Steps for Apriori Algorithm
• Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.
• Step-3: Find all the rules of these subsets that have higher confidence value than
the threshold or minimum confidence.
In the first step, we will create a table that contains support count (The
frequency of each itemset individually in the dataset) of each itemset in the
given dataset. This table is called the Candidate set or C1.
Apriori Algorithm Working (Contd…)
o Now, we will take out all the itemsets that have the greater support count
that the Minimum Support (2). It will give us the table for the frequent
itemset L1. Since all the itemsets have greater or equal support count
than the minimum support, except the E, so E itemset will be removed.
Apriori Algorithm Working (Contd…)
o Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So, the L3 will
the occurred combination {A, B.C}. For all the rules, we will calculate the Confidence using
formula sup( A^B)/A. After calculating the confidence value for all rules, we will exclude
the rules that have less confidence than the minimum threshold(50%).
• As the given threshold or minimum confidence is 50%, so the first three rules
A ^B → C, B^C → A, and A^C → B can be considered as the strong
association rules for the given problem.
Methods to Improve Apriori Efficiency
Many methods are available for improving the efficiency of the algorithm as
given below:
3. Partitioning : This method requires only two database scans to mine the
frequent itemsets. It says that for any itemset to be potentially frequent in the
database, it should be frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from database D and then
searches for frequent itemset in S. It may be possible to lose a global frequent
itemset. This can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsests
at any marked stat point of the database during the scanning of the database.
Applications of Apriori Algorithm
• It requires high computation if the itemsets are very large and the minimum
support is kept very low.
1. Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence).
2. This can be done using equation (1) for confidence, which is shown here for
completeness :
5. Because the rules are generated from frequent itemsets, each one automatically
satisfies the minimum support.
6. Frequent itemsets can be stored ahead of time in hash tables along with their counts so
that they can be accessed quickly.
Applications of Frequent Itemset Analysis
• Related concepts :
1. Let items be words, and let baskets be documents (e.g., Web pages, blogs, tweets).
3. If we look for sets of words that appear together in many documents, the sets will
be dominated by the most common words (stop words).
4. If the document contain many the stop words such as “and” and “a” then it will
consider as more frequent itemsets.
5. However, if we ignore all the most common words, then we would hope to find
among the frequent pairs some pairs of words that represent a joint concept.
Applications of Frequent Itemset Analysis (Contd…)
• Plagiarism :
4. In this application, we look for pairs of items that appear to gather in several baskets.
5. If we find such a pair, then we have two documents that share several sentences in
common.
Applications of Frequent Itemset Analysis (Contd…)
• Biomarkers :
1. Let the items be of two types such as genes or blood proteins, and diseases.
2. Each basket is the set of data about a patient: their genome and blood-
chemistry analysis, as well as their medical history of disease.
• When we refer to large data in this chapter, we mean data that cause problems to work
with in terms of memory or speed but can still be handled by a single computer. We
start this chapter with an overview of the problems you face when handling large
datasets.
• Then we offer three types of solutions to overcome these problems: adapt your
algorithms, choose the right data structures, and pick the right tools. Data scientists
aren,t the only ones who must deal with large data volumes, so you can apply general
best practises to tackle the large data problem. Finally, we apply this knowledge to two
case studies. The first case shows you how to detect malicious URLs, and the second
case demonstrates how to build a recommender engine inside a database.
The Problems You Face when Handling Large Data
• A large volume of data poses new
challenges, such as overloaded memory
and algorithms that never stop running.
It forces you to adapt and expand your
repertoire of techniques. But even
when you can perform your analysis,
you should take care of issues such as
I/O (input/output) and CPU starvation,
because these can cause speed issues.
Figure: Overview of Problems Encountered when Working with More Data can
Fit in Memory
General Techniques for Handling Large Volumes of Data
• Algorithms can make or break your program, but the way you store your data
is of equal importance. Data structure have different storage requirements, but
also influence the performance of CRUD (create, read, update, and delete) and
other operations on the dataset. Figure shows you have many different data
structures to choose from, three of which will be here: sparse data, tree data,
and hash data.
Choosing the Right Data Structure (Contd…)
Selecting the Right Tools
2. Work with a Smaller Sample: Take a random sample of your data, such as the
1,000 or 100,000 rows. Use this smaller sample to work through your problem
before fitting a final model on all of your data.
3. Change the Data Format: Is your data stored in raw ASCII text, like a CSV file?
Perhaps you can speed up data loading and use less memory by using another data
format. A good example is a binary format like GRIB, NetCDF, or HDF.
Ways to Handle Large Data Files for Machine Learning (Contd…)
4. Stream Data or Use Progressive Loading: Does all the data need to be in
memory at the same time? Perhaps you can use code or a library to stream or
progressively load data as-needed into memory for training.
6. Use a Big Data Platform: In some cases, you may need to resort to a big data
platform.
Limited-Pass Algorithms
• The algorithms for frequent itemsets discussed so far use one pass for each
size of itemset we investigate. If main memory is too small to hold the
data and the space needed to count frequent itemsets of one size, there
does not seem to be any way to avoid k passes to compute the exact
collection of frequent itemsets. However, there are many applications
where it is not essential to discover every frequent itemset.
• In this section we explore some algorithms that have been proposed to find all or
most frequent itemsets using at most two passes. We begin with the obvious
approach of using a sample of the data rather than the entire dataset. An
algorithm called SON uses two passes, gets the exact answer, and lends itself
to implementation by map-reduce or another parallel computing regime.
Finally, Toivonen’s Algorithm uses two passes on average, gets an exact
answer, but may, rarely, not terminate in any given amount of time.
Simple and Randomized Algorithm
1. In simple and randomized algorithm, we pick a random subset of the baskets and
pretend it is the entire dataset instead of using the entire file of baskets.
2. We must adjust the support threshold to reflect the smaller number of baskets.
3. For instance, if the support threshold for the full dataset is s, and we choose a sample
of 1% of the baskets, then we should examine the sample for itemsets that appear in at
least s/100 of the baskets.
4. The best way to pick the sample is to read the entire dataset, and for each basket,
select that basket for the sample with some fixed probability p.
5. Suppose there are m baskets in the entire file. At the end, we shall have a sample
Simple and Randomized Algorithm (Contd…)
6. However, if the baskets appear in random order in the file already, then we do not even
have to read the entire file.
7. We can select the first pm baskets for our sample. Or, if the file is part of a distributed
file system, we can pick some chunks at random to serve as the sample.
8. Having selected our sample of the baskets, we use part of main memory to store these
baskets.
9. Remaining main memory is used to execute one of the algorithms such as A-Priori or
PCY. However, the algorithm must run passes over the main-memory sample for each
itemset size, until we find a size with no frequent items.
Savasere, Omiecinski, and Navathe (SON) algorithm to find all or most frequent itemsets using at most two passes
2. Treat each chunk as a sample and run the simple and randomized Data Analytics
algorithm on that chunk.
3. We use ps as the threshold, if each chunk is fraction p of the whole file, and s is the
support threshold.
4. Store on disk all the frequent itemsets found for each chunk.
5. Once all the chunks have been processed in that way, take the union of all the itemsets
that have been found frequent for one or more chunks. These are the candidate itemsets.
SON algorithm to find all or most frequent itemsets using at most
two passes (Contd…)
6. If an itemset is not frequent in any chunk, then its support is less than ps in each
chunk. Since the number of chunks is 1/p, we conclude that the total support for that
itemset is less than (1/p)ps = s.
7. Thus, every itemset that is frequent in the whole is frequent in at least one chunk,
and we can be sure that all the truly frequent itemsets are among the candidates; i.e.,
there are no false negatives. We have made a total of one pass through the data as we
read each chunk and processed it.
8. In a second pass, we count all the candidate itemsets and select those that have
support at least s as the frequent itemsets.
SON Algorithm and MapReduce
2. Each of the chunks can be processed in parallel, and the frequent itemsets
from each chunk combined to form the candidates.
• Toivonen’s algorithm, given sufficient main memory, will use one pass over a
small sample and one full pass over the data. It will give neither false negatives
nor positives, but there is a small but finite probability that it will fail to produce
any answer at all. In that case it needs to be repeated until it gives an answer.
However, the average number of passes of passes needed before it produces all
and only the frequent itemsets is a small constant.
• Toivonen’s algorithm begins by selecting a small sample of the input dataset and
finding from it the candidate frequent itemsets.
Counting Frequent Items in a Stream
(a) Use the file that was collected while the first iteration of the algorithm was running. At the same
time, collect yet another file to be used at another iteration of the algorithm, when this current
iteration finishes.
(b) Start collecting another file of baskets now and run the algorithm when an adequate number of
baskets has been collected.
We can continue to count the numbers of occurrences of each of these frequent itemsets, along
with the total number of baskets seen in the stream, since the counting started. If any itemset is
discovered to occur in a fraction of the baskets that is significantly below the threshold fraction s,
then this set can be dropped from the collection of frequent itemsets. If not, we run the risk that we
shall encounter a short period in which a truly frequent itemset does not appear sufficiently
Counting Frequent Items in a Stream (Count…)
• We should also allow some way for new frequent itemsets to be added to the
current collection. Possibilities include:
(a) Periodically gather a new segment of the baskets in the stream and use it as the
data file for another iteration of the chosen frequent itemsets algorithm. The new
collection of frequent items is formed from the result of this iteration and the
frequent itemsets from the previous collection that have survived the possibility
of having been deleted for becoming infrequent.
(b) Add some random itemsets to the current collection, and count their fraction
of occurrences for a while, until one has a good idea of whether they are currently
frequent. Rather than choosing new itemsets completely at random, one might
focus on sets with items that appear in many itemsets already known to be
frequent.
Clustering Techniques
•Cluster analysis is the process of finding similar groups of objects to form clusters. It
is an unsupervised machine learning-based algorithm that acts on unlabeled data. A
group of data points would come together to form a cluster in which all the objects
would belong to the same group.
• Clustering is the process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity but are very dissimilar to
objects in other clusters. Dissimilarities and similarities are assessed based on the
attribute values describing the objects and often involve distance measures.
• Clustering is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are like one another, yet
dissimilar to objects in other clusters. Clustering as a data analysis tool has its roots in
many application areas such as biology, security and business intelligence and Web
search.
Clustering Techniques (Contd…)
• The intra-cluster similarities are high; this implies that the data present inside
the cluster is similar to one another.
• The inter-cluster similarity is low, and it means each cluster holds data that is
not similar to other data.
Applications of Cluster Analysis in Data Mining
• It assists marketers finding different groups in their client base based on their
purchasing patterns. They can characterise their customer groups.
• Clustering is also used in tracking applications such as detection of credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data and analyse the characteristics of each cluster.
Requirements for Cluster Analysis
There are different types of clustering algorithms that handle all kinds of unique data.
1. Partitioning methods: Given a set of n objects, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k<= n. That is, it
divides the data into k groups such that each group must contain at least one object.
The basic partitioning methods typically adopt exclusive cluster separation. That is,
each object must belong to exactly one group. Most partitioning methods are
distance-based. The general criterion of a good partitioning is that objects in the same
cluster are 'close' or related to each other, whereas objects in different clusters are 'far
apart' or very different.
Types of Clustering (Contd…)
• In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
• Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
Hierarchical Clustering in Machine Learning (Contd…)
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
How the Agglomerative Hierarchical clustering Work? (Contd…)
Step 3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop
Measure for the distance between two clusters
As we have seen, the closest distance between the two clusters is crucial for
the hierarchical clustering. There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage
methods are given below:
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
Measure for the distance between two clusters (Contd…)
• The working of the dendrogram can be explained using the below diagram:
Woking of Dendrogram in Hierarchical clustering (Contd…)
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
• As we have discussed above, firstly, the datapoints P2 and P3 combine and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular
shape. The Hight is decided according to the Euclidean distance between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It
is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater
than the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
K-Means: A centroid-Based Technique (Partitioning Based
Method)
• It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
2. Assigns each data point to its closest k-center. Those data points which are
near to the k-center, create a cluster.
The k-means Clustering (Contd…)
The below diagram explains the working of the K-means Clustering Algorithm:
k-means Clustering Algorithm
K-means: The k-means algorithm for partitioning, where each cluster's center is represented by the
mean value of the objects in the cluster.
Input:
K: the number of clusters,
D: a dataset containing n objects.
Output: A set of K clusters.
Method:
1. Arbitrarily choose K objects from D as the initial cluster centers;
2. Repeat
3. (Re) assign each object to the cluster to which the object is the most similar, based on the mean
value of the objects in the cluster;
4. Update the cluster means, that is, calculate the mean value of the objects for each cluster;
5. Until there is a change.
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Bi-clustering methods
1. Subspace Search Methods: A subspace search method searches the subspaces for
clusters. Here, the cluster is a group of similar types of objects in a subspace. The
similarity between the clusters is measured by using distance or density features.
CLIQUE algorithm is a subspace clustering method. subspace search methods search a
series of subspaces. There are two approaches in Subspace Search Methods: Bottom-up
approach starts to search from the low-dimensional subspaces. If the hidden clusters are
not found in low-dimensional subspaces then it searches in higher dimensional subspaces.
The top-down approach starts to search from the high-dimensional subspaces and then
search in subsets of low-dimensional subspaces. Top-down approaches are effective if
the subspace of a cluster can be defined by the local neighborhood sub-space clusters.
Subspace Clustering Methods (Contd…)
Bi-clustering means clustering the data based on the two factors. we can cluster both objects and
attributes at a time in some applications. The resultant clusters are biclusters. To perform the bi-
clustering there are four requirements:
• The data objects can take part in multiple clusters, or the objects may also include in any cluster.
Objects and attributes are not treated in the same way. Objects are clustered according to their
attribute values. We treat Objects and attributes as different in biclustering analysis.
CLIQUE (Clustering in QUEst)
• The CLIQUE algorithm first divides the data space into grids. It is done by dividing
each dimension into equal intervals called units. After that, it identifies dense units.
A unit is dense if the data points in this are exceeding the threshold value.
• Once the algorithm finds dense cells along one dimension, the algorithm tries to find
dense cells along two dimensions, and it works until all dense cells along the entire
dimension are found.
• After finding all dense cells in all dimensions, the algorithm proceeds to find the
largest set (“cluster”) of connected dense cells. Finally, the CLIQUE algorithm
generates a minimal description of the cluster. Clusters are then generated from all
dense subspaces using the apriori approach.
Advantage of CLIQUE Algorithm
• CLIQUE can find clusters of any shape and is able to find any number of
clusters in any number of dimensions, where the number is not predetermined
by a parameter.
• Projected clustering is the first, top-down partitioning projected clustering algorithm based on the
notion of k- medoid clustering which was presented by Aggarwal (1999). It determines medoids
for each cluster repetitively on a sample of data using a greedy hill climbing technique and then
upgrades the results repetitively. Cluster quality in projected clustering is a function of average
distance between data points and the closest medoid. Also, the subspace dimensionality is an input
framework which generates clusters of alike sizes.
2. PROCLUS samples the data and then selects a set of k-medoids and iteratively improves the
clustering.
3. PROCLUS is actually faster than CLIQUE due to the sampling of large data sets. Frequent
Itemsets & Clustering
PROCLUS (Projected Clustering) (Contd…)
a) Initialization phase : Select a set of potential medoids that are far apart using a greedy algorithm.
b) Iteration phase :
i. Select a random set of k-medoids from this reduced data set to determine if clustering quality
improves by replacing current medoids with randomly chosen new medoids.
ii. Cluster quality is based on the average distance between instances and the nearest medoid.
iii. For each medoid, a set of dimensions is chosen whose average distances are small compared to
statistical expectation.
iv. Once the subspaces have been selected for each medoid, average Manhattan segmental distance is
used to assign points to medoids, forming dusters.
PROCLUS (Projected Clustering) (Contd…)
c) Refinement phase :
i. Compute a new list of relevant dimensions for each medoid
based on the clusters formed and reassign points to medoids,
removing outliers.
ii. The distance-based approach of PROCLUS is biased toward
clusters that are hype-spherical in shape.
Clustering For Streams and Parallelism
• Data stream clustering refers to the clustering of data that arrives continually
such as financial transactions, multimedia data, or telephonic records. It is usually
studied as a “Streaming Algorithm.” The purpose of Data Stream Clustering
is to contract a good clustering stream using a small amount of time and memory.
• Technically, Clustering is the act of grouping elements using sets. The main
purpose of this type of separation is to unite items that are similar to each other,
using the comparison of a series of characteristics of these. When we talk about
Data Stream, we can separate its methods into five categories, namely
partitioning, hierarchical, density-based, grid-based and model-based.
Clustering For Streams and Parallelism (Contd…)
• There is one more factor to take into account when talking about clusters. It
is possible to divide the possible distances in four, being the minimum (or
single connection), maximum (or complete connection), mean distance and
the average, and each one has its characteristics regarding the cost of
implementation and computational power, being that the minimum distance
and mean distance are more common to use in Data Stream Clustering.
Basic Subspace Clustering Approaches
1. Grid-based subspace clustering :
a) In this approach, data space is divided into axis-parallel cells. Then the
cells containing objects above a predefined threshold value given as a
parameter are merged to form subspace clusters. Number of intervals is
another input parameter which defines range of values in each grid.
b) Apriori property is used to prune non-promising cells and to improve
efficiency.
c) If a unit is found to be dense in k – 1 dimension, then it is considered for
finding dense unit in k dimensions.
d) If grid boundaries are strictly followed to separate objects, accuracy of
clustering result is decreased as it may miss neighboring objects which get
separated by string grid boundary. Clustering quality is highly dependent
on input parameters.
Basic Subspace Clustering Approaches (Contd…)
c) The size of the sliding window is one of the parameters. These algorithms
generate axis-parallel subspace clusters.
Basic Subspace Clustering Approaches (Contd…)
3. Density- based subspace clustering :
a) A density-based subspace clustering overcome drawbacks of grid-based
subspace clustering algorithms by not using grids.
b) A cluster is defined as a collection of objects forming a chain which fall
within a given distance and exceed predefined threshold of object count. Then
adjacent dense regions are merged to form bigger clusters.
c) As no grids are used, these algorithms can find arbitrarily shaped subspace
clusters.
d) Clusters are built by joining together the objects from adjacent dense regions.
e) These approaches are prone to values of distance parameters.
f) The effect curse of dimensionality is overcome in density-based algorithms
by utilizing a density measure which is adaptive to subspace size.
Clustering High-Dimensional Data
• Most clustering methods are designed for clustering low-dimensional data and
encounter challenges when the dimensionality of the data grows really high
(say, over 10 dimensions, or even over thousands of dimensions for some
tasks). This is because when the dimensionality increases, usually only a small
number of dimensions are relevant to certain clusters, but data in the irrelevant
dimensions may produce much noise and mask the real clusters to be
discovered.
• To overcome this difficulty, we may consider using feature for attribute)
transformation and feature (or attribute) selection transformation methods, such
as principal component analysis and singular value decom Festion, transform
the data onto a smaller space while generally preserving the original relative
distance between objects. They summarise data by creating linear combinations
of the attributes and may discover hidden structures in the data.
Clustering High-Dimensional Data (Contd…)
• Another way of tackling the curse of dimensionality is to try to remove some of the
dimensions. Attribute subset selection (or feature subset selection) is commonly used for data
reduction by removing irrelevant or redundant dimensions (or attributes).
• Given a set of attributes, attribute subset selection finds the subset of attributes that are most
relevant to the data mining task. It is most commonly performed by supervised learning-the
most relevant set of attributes are found with respect to the given class labels. It can also be
performed by an unsupervised process, such as entropy analysis, which is based on the
property that entropy tends to be low for data that contain tight clusters. Other evaluation
functions, such as category utility, may also be used.
• Subspace clustering is an extension to attribute subset selection that has shown its strength at
high-dimensional clustering. It is based on the observation that different subspaces may
contain different, meaningful clusters. Subspace clustering searches for groups of clusters
within different subspaces of the same dataset. The problem becomes how to find such
subspace clusters effectively and efficiently.
Frequent Pattern-Based Clustering Methods
• Frequent pattern mining, as the name implies, searches for patterns (such as sets
of items or objects) that occur frequently in large datasets.
• Frequent pattern mining can lead to the discovery of interesting associations and
correlations among data objects. The idea behind frequent pattern-based cluster
analysis is that the frequent patterns discovered may also indicate clusters.
• Frequent pattern-based cluster analysis is well suited to high-dimensional data. It
can be viewed as an extension of the dimension-growth subspace clustering
approach. However, the boundaries of different dimensions are not obvious,
since here they are represented by sets of frequent itemsets. That is, rather than
growing the clusters dimension by dimension, we grow sets of frequent itemsets,
which eventually lead to cluster descriptions.
• Typical examples of frequent pattern-based cluster analysis include the
clustering of text documents that contain thousands of distinct keywords, and the
analysis of microarray data that contain tens of thousands of measured values or
'features
Frequent Term-Based Text Clustering
• In frequent term-based text clustering, text documents are clustered based on
the frequent terms they contain. Using the vocabulary of text document
analysis, a term is any sequence of characters separated from other terms by a
delimiter. A term can be made up of a single word or several words.
• In general, we first remove nontext information (such as HTML tags and
punctuation) and stop words. Terms are then extracted. A stemming algorithm
is then applied to reduce each term to its basic stem. In this way, each
document can be represented as a set of terms. Each set is typically large.
Collectively, a large set of documents will contain a very large set of distinct
terms. If we treat each term as a dimension, the dimension space will be of very
high dimensionality.
Clustering in Non-Euclidean Spaces
• Now we discuss an algorithm that handles non-main-memory data but does not
require a Euclidean space. The algorithm, which we shall refer to as GRGPF for
its authors (V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French),
takes ideas from both hierarchical and point-assignment approaches. Like
CURE, it represents clusters by sample points in main memory.
• However, it also tries to organise the clusters hierarchically, in a tree, so a new
point can be assigned to the appropriate cluster by passing it down the tree.
Leaves of the tree hold summaries of some clusters, and interior nodes hold
subsets of the information describing the clusters reachable through that node.
An attempt is made to group clusters by their distance from one another, so the
clusters at a leaf are close, and the clusters reachable from one interior node are
relatively close as well.
Clustering in Non-Euclidean Spaces (Contd…)
• As we assign points to clusters, the clusters can grow large. Most of the points
in a cluster are stored on disk and are not used in guiding the assignment of
points, although they can be retrieved. The representation of a cluster in main
memory consists of several features. Before listing these features, if p is any
point in a cluster, let ROWSUM(p) be the sum of the squares of the distances
from p to each of the other points in the cluster. Note that, although we are not
in a Euclidean space, there is some distance measure d that applies to points, or
else it is not possible to cluster points at all.
Clustering in Non-Euclidean Spaces (Contd…)
The following features form the representation of a cluster:
1. N, the number of points in the cluster.
2. The clustroid of the cluster, which is defined specifically to be the point in the
cluster that minimises the sum of the squares of the distances to the other points; that
is, the clustroid is the point in the cluster with the smallest ROWSUM.
3. The rowsum of the clustroid of the cluster.
4. For some chosen constant k, the k points of the cluster that are closest to the
clustroid, and their rowsums. These points are part of the representation in case the
addition of points to the cluster causes the clustroid to change. The assumption is
made that the new clustroid would be one of these k points near the old clustroid.
5. The k points of the cluster that are furthest from the clustroid and their rowsums.
These points are part of the representation so that we can consider whether two
clusters are close enough to merge. The assumption is made that if two clusters are
close, then a pair of points distant from their respective clustoids would be close.