100% found this document useful (1 vote)
43 views125 pages

DA Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
43 views125 pages

DA Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 125

Data Analytics (BCS-052)

Unit 4
Frequent Itemsets and Clustering
Syllabus

Frequent Itemsets and Clustering: Mining frequent itemsets, market-based


modelling, Apriori algorithm, handling large data sets in main memory,
limited pass algorithm, counting frequent itemsets in a stream, clustering
techniques: hierarchical, K-means, clustering high dimensional data, CLIQUE
and ProCLUS, frequent pattern-based clustering methods, clustering in non-
euclidean space, clustering for streams and parallelism.
Mining Frequent Itemsets
• Mining frequent itemsets is a fundamental technique in data mining, used to
identify patterns, correlations, or associations among large sets of data items.

• It is a key step in association rule learning, particularly in applications like


market basket analysis, where it helps to discover items frequently bought
together.

• The main objective is to find itemsets (combinations of items) that appear


frequently in a dataset, meeting a user-defined threshold for "support" (the
minimum number of times an itemset must appear to be considered frequent).
Itemset

• A set of items together is called an itemset. If any itemset have k-items it is


called a k-itemset. An itemset consists of two or more items.

• A itemset that occurs frequently is called a frequent itemset. Thus frequent


itemset mining is a data mining technique to identify the items that often occur
together.

• For Example, bread and butter, laptop and an antivirus software, etc.
Frequently used Itemset

• A set of items is called frequent if it satisfies a minimum threshold value for


support and confidence. Support shows transactions with items purchased
together in a single transaction. Confidence shows transaction where the items
are purchased one after the other.

• Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A and B are
the frequent itemsets together, then individually A and B should also be the
frequent itemset.
Frequently used Itemset (Contd…)

• Suppose there are two transactions: A = { 1, 2, 3, 4, 5}, and B = { 2, 3,7}, in


these two transactions, 2 and 3 are the frequent itemsets.

• For the frequent itemset mining method, consider only those transaction that
meet minimum threshold support and confidence requriments. Insights from
these mining algorithms offer a lot of benefits, including cost-cutting and
improved competitive advantage.
Frequent Pattern Mining (FPM)
• The frequent pattern mining algorithm is one of the most important techniques
of data mining to discover relationship between different items in a dataset.
These relationships are represented in the form of association rules. It helps to
find the irregularities in data.
• FPM has many applications in the fields of data analysis, software bugs, cross-
marketing, sale campaign analysis, market basket analysis, etc.
• Frequent itemsets discovered through Apriori have many applications in data
mining tasks. Tasks such as finding interesting patterns in the database, finding
out sequences and mining of association rules – are the most important among
them.
• Association rules apply to supermarket transaction data, that is, to examine
the customer behaviour in terms of the purchased products. Association rules
describe how often the items are purchased together.
Frequent Item set in Data set (Association Rule Mining)

• Frequent item sets, also known as association rules, are a fundamental concept
in association rule mining, which is a technique used in data mining to
discover relationships between items in a dataset. The goal of association
rule mining is to identify relationships between items in a dataset that occur
frequently together.
• A frequent item set is a set of items that occur together frequently in a
dataset. The frequency of an item set is measured by the support count, which
is the number of transactions or records in the dataset that contain the item set.
For example, if a dataset contains 100 transactions and the item set {milk,
bread} appears in 20 of those transactions, the support count for {milk, bread}
is 20.
Frequent Item set in Data set (Association Rule Mining) (Contd…)

• Association rule mining algorithms, such as Apriori or FP-Growth, are used


to find frequent item sets and generate association rules.

• Frequent item sets and association rules can be used for a variety of tasks
such as market basket analysis, cross-selling and recommendation systems.

• Association Rule Mining Consists of the following two steps:

1. Find all the frequent itemset.

2. Generate association rules from the above frequent itemsets.


Need of Association Mining

• Frequent mining is the generation of association rules from a Transactional


Dataset. If there are 2 items X and Y purchased frequently then it’s good to put
them together in stores or provide some discount offer on one item on purchase
of another item. This can really increase sales. For example, it is likely to find
that if a customer buys Milk and bread he/she also buys Butter. So the
association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the
customer buy butter if he/she buys Milk and Bread.
Important Definitions

• Support : It is one of the measures of interestingness. This tells about the


usefulness and certainty of rules. 5% Support means total 5% of transactions
in the database follow the rule.

Support(A -> B) = Support_count(A ∪ B)

• Confidence: A confidence of 60% means that 60% of the customers who


purchased a milk and bread also bought butter.

Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)


Important Definitions (Contd…)

• Support_count(X): Number of transactions in which X appears. If X is A union B,


then it is the number of transactions in which A and B both are present.

• Maximal Itemset: An itemset is maximal frequent if none of its supersets are


frequent.

• Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.

• K- Itemset: Itemset which contains K items is a K-itemset. So, it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.
Market-Based Modeling

• A data mining technique that is used to uncover purchase patterns in any


retail setting is known as Market Basket Analysis. Basically, market basket
analysis in data mining involves analyzing the combinations of products that
are bought together.

• This is a technique that gives the careful study of purchases done by a


customer in a supermarket. This concept identifies the pattern of frequent
purchase items by customers. This analysis can help to promote deals, offers,
sale by the companies, and data mining techniques helps to achieve this
analysis task.
Market Basket Analysis (Contd…)

• Market basket analysis mainly works with the ASSOCIATION RULE


MINING.

• For example: IF a customer buys bread, THEN he is likely to buy butter as


well. Association rules are usually represented as: {Bread} -> {Butter}.

• Examples of Market Basket Analysis: Retail, Telecom, Medicine.


Working of Market Basket Analysis

• In simple terms market basket analysis in data mining and data analysis is to
analyse the combination of products that have been bought together. This is a
technique that gives a careful study of purchases made by a customer in a
supermarket. This concept identifies the pattern of frequent purchase of items
by customers. This analysis can help to promote deals, offers, and sales by the
companies and data mining techniques help to achieve this task.

• Example: Data mining concepts are in use for sales and marketing to provide
better customer service, to improve cross-selling opportunities, to increase
direct mail response rates.
Terminologies used with Market-Based Modeling
Market basket analysis mainly works
with the ASSOCIATION RULE {IF}
-> {THEN}.
• IF means Antecedent: An
antecedent is an item found within
the data.
• THEN means Consequent: A
consequent is an item found in
combination with the antecedent.
Types of Market Basket Analysis

There are three types of Market Basket Analysis. They are as follow:

1. Descriptive market basket analysis: This type only derives insights from
past data and is most frequently used approach. This kind of study is mostly
used to understand consumer behavior, including what products are purchased
in combination and what the most typical item combinations. Retailers can
place products in their stores more profitably by understanding which products
are frequently bought together with the aid of descriptive market basket
analysis. This type of modelling is known as unsupervised learning.
Types of Market Basket Analysis (Contd…)

2. Predictive Market Basket Analysis: Market basket analysis that predicts


future purchases based on past purchasing patterns is known as predictive
market basket analysis. Large volumes of data are analyzed using machine
learning algorithms in this type of analysis in order to create predictions about
which products are most likely to be bought together in the future. Retailers
may make data-driven decisions about which products to carry, how to price
them, and how to optimize shop layouts with the use of predictive market
basket research. This type uses supervised learning models like classification
and regression.
Types of Market Basket Analysis (Contd…)

3. Differential Market Basket Analysis: This type of analysis is beneficial for


competitor analysis. It compares purchase history between stores, between
seasons, between two time periods, between different days of the week, etc., to
find interesting patterns in consumer behaviour. For example, it can help
determine why some users prefer to purchase the same product at the same
price on Amazon vs. Flipkart. The answer could be that the Amazon reseller has
more warehouses and can deliver faster or maybe something more profound
likes user experience.
Benefits of Market Basket Analysis

1. Enhanced Customer Understanding: Market basket research offers insights


into customer behavior, including what products they buy together and which
products they buy the most frequently. Retailers can use this information to
better understand their customers and make informed decisions.

2. Improved Inventory Management: By examining market basket data,


retailers can determine which products are sluggish sellers and which ones are
commonly bought together. Retailers can use this information to make well-
informed choices about what products to stock and how to manage their
inventory most effectively.
Benefits of Market Basket Analysis (Contd…)

3. Better Pricing Strategies: A better understanding of the connection between


product prices and consumer behavior might help merchants develop better
pricing strategies. Using this knowledge, pricing plans that boost sales and
profitability can be created.

4. Sales Growth: Market basket analysis can assist businesses in determining


which products are most frequently bought together and where they should be
positioned in the store to grow sales. Retailers may boost revenue and enhance
customer shopping experiences by improving store layouts and product
positioning.
Applications of Market Basket Analysis

1. Retail: Market basket research is frequently used in the retail sector to examine
consumer buying patterns and inform decisions about product placement,
inventory management, and pricing tactics. Retailers can utilize market basket
research to identify which items are sluggish sellers and which ones are commonly
bought together and then modify their inventory management strategy accordingly.

2. E-commerce: Market basket analysis can help online merchants better understand
the customer buying habits and make data-driven decisions about product
recommendations and targeted advertising campaigns. The behaviour of visitors to
a website can be examined using market basket analysis to pinpoint problem areas.
Applications of Market Basket Analysis (Contd…)
3. Finance: Market basket analysis can be used to evaluate investor behaviour and forecast the
types of investment items that investors will likely buy in the future. The performance of
investment portfolios can be enhanced by using this information to create tailored investment
strategies.
4. Telecommunications: To evaluate consumer behaviour and make data-driven decisions about
which goods and services to provide, the telecommunications business might employ market
basket analysis. The usage of this data can enhance client happiness and the shopping
experience.
5. Manufacturing: To evaluate consumer behaviour and make data-driven decisions about
which products to produce and which materials to employ in the production process, the
manufacturing sector might use market basket analysis. Utilizing this knowledge will increase
effectiveness and cut costs.
Apriori Algorithm

• The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets from
the large dataset.

• This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is
mainly used for market basket analysis and helps to find those products that can be
bought together. It can also be used in the healthcare field to find drug reactions for
patients.
Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

• Step-1: Determine the support of itemsets in the transactional database, and


select the minimum support and confidence.

• Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.

• Step-3: Find all the rules of these subsets that have higher confidence value than
the threshold or minimum confidence.

• Step-4: Sort the rules as the decreasing order of lift.


Apriori Algorithm Working

We will understand the apriori


algorithm using an example
and mathematical calculation:
• Example: Suppose we have
the following dataset that has
various transactions, and
from this dataset, we need to
find the frequent itemsets and
generate the association rules
using the Apriori algorithm:
Apriori Algorithm Working (Contd…)
Solution:
• Step-1: Calculating C1 and L1:

In the first step, we will create a table that contains support count (The
frequency of each itemset individually in the dataset) of each itemset in the
given dataset. This table is called the Candidate set or C1.
Apriori Algorithm Working (Contd…)

o Now, we will take out all the itemsets that have the greater support count
that the Minimum Support (2). It will give us the table for the frequent
itemset L1. Since all the itemsets have greater or equal support count
than the minimum support, except the E, so E itemset will be removed.
Apriori Algorithm Working (Contd…)

• Step-2: Candidate Generation C2, and L2:


o In this step, we will generate C2 with the help
of L1. In C2, we will create the pair of the
itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find
the support count from the main transaction
table of datasets, i.e., how many times these
pairs have occurred together in the given
dataset. So, we will get the below table for C2:
Apriori Algorithm Working (Contd…)

o Again, we need to compare the C2 Support count with the minimum


support count, and after comparing, the itemset with less support count
will be eliminated from the table C2. It will give us the below table for L2
Apriori Algorithm Working (Contd…)
• Step-3: Candidate generation C3, and L3:
o For C3, we will repeat the same two processes, but now we will form the
C3 table with subsets of three itemsets together, and will calculate the
support count from the dataset. It will give the below table:
Apriori Algorithm Working (Contd…)

o Now we will create the L3 table. As we can see from the above C3 table, there is only one

combination of itemset that has support count equal to the minimum support count. So, the L3 will

have only one combination, i.e., {A, B, C}.

• Step-4: Finding the association rules for the subsets:


• To generate the association rules, first, we will create a new table with the possible rules from

the occurred combination {A, B.C}. For all the rules, we will calculate the Confidence using

formula sup( A^B)/A. After calculating the confidence value for all rules, we will exclude

the rules that have less confidence than the minimum threshold(50%).

• Consider the below table:


Apriori Algorithm Working (Contd…)
Rules Support Confidence

A^B → C 2 Sup{(A^B) ^C}/sup(A^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

• As the given threshold or minimum confidence is 50%, so the first three rules
A ^B → C, B^C → A, and A^C → B can be considered as the strong
association rules for the given problem.
Methods to Improve Apriori Efficiency

Many methods are available for improving the efficiency of the algorithm as
given below:

1. Hash-Based Technique: This method uses a hash-based structure called a


hash table for generating the k-itemsets and its corresponding count. It uses a
hash function for generating the table.

2. Transaction Reduction: This method reduces the number of transaction


scanning in iterations. The transactions which do not contain frequent items
are marked or removed.
Methods to Improve Apriori Efficiency (Contd…)

3. Partitioning : This method requires only two database scans to mine the
frequent itemsets. It says that for any itemset to be potentially frequent in the
database, it should be frequent in at least one of the partitions of the database.

4. Sampling: This method picks a random sample S from database D and then
searches for frequent itemset in S. It may be possible to lose a global frequent
itemset. This can be reduced by lowering the min_sup.

5. Dynamic Itemset Counting: This technique can add new candidate itemsests
at any marked stat point of the database during the scanning of the database.
Applications of Apriori Algorithm

1. In Education Field: Extracting association rules in data mining of admitted


students through characteristics and specialties.

2. In Medical Field: For example Analysis of the patient’s database.

3. In Forestry: Analysis of probability and intensity of forest fire with the


forest fire data.

4. Apriori is used by many companies like Amazon in the Recommender


System and by Google for the auto-complete feature.
Advantages of Apriori Algorithm

o Easy to understand algorithm.


o The join and prune steps of the algorithm can be easily implemented on
large datasets.

Disadvantages of Apriori Algorithm

• It requires high computation if the itemsets are very large and the minimum
support is kept very low.

• The entire database needs to be scanned.


Generating association rules from frequent itemsets

1. Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence).

2. This can be done using equation (1) for confidence, which is shown here for
completeness :

Confidence(AÞ B) = P(B|A) = support_count(AU B) / support_count(A ) (1)

3. The conditional probability is expressed in terms of itemset support count, where


support_count (A U B) is the number of transactions containing the itemsets A U B, and
support_count (A) is the number of transactions containing the itemset A.
Generating association rules from frequent itemsets (Contd…)

4. Based on equation (1), association rules can be generated as follows :


a. For each frequent itemset l, generate all non-empty subsets of l. Frequent
Itemsets & Clustering
b. For every non-empty subset s of l, output the rule (l – s) if (support_count(l)/
support_count(s)) min_conf, where min_conf is the minimum confidence
threshold.

5. Because the rules are generated from frequent itemsets, each one automatically
satisfies the minimum support.

6. Frequent itemsets can be stored ahead of time in hash tables along with their counts so
that they can be accessed quickly.
Applications of Frequent Itemset Analysis

• Related concepts :

1. Let items be words, and let baskets be documents (e.g., Web pages, blogs, tweets).

2. A basket/document contains those items/words that are present in the document.

3. If we look for sets of words that appear together in many documents, the sets will
be dominated by the most common words (stop words).

4. If the document contain many the stop words such as “and” and “a” then it will
consider as more frequent itemsets.
5. However, if we ignore all the most common words, then we would hope to find
among the frequent pairs some pairs of words that represent a joint concept.
Applications of Frequent Itemset Analysis (Contd…)

• Plagiarism :

1. Let the items be documents and the baskets be sentences.

2. An item is in a basket if the sentence is in the document.

3. This arrangement appears backwards, and we should remember that the


relationship between items and baskets is an arbitrary many-many relationship.

4. In this application, we look for pairs of items that appear to gather in several baskets.

5. If we find such a pair, then we have two documents that share several sentences in
common.
Applications of Frequent Itemset Analysis (Contd…)

• Biomarkers :

1. Let the items be of two types such as genes or blood proteins, and diseases.

2. Each basket is the set of data about a patient: their genome and blood-
chemistry analysis, as well as their medical history of disease.

3. A frequent itemset that consists of one disease and one or more


biomarkers suggest a test for the disease.
Handling Large Dataset in Main Memory

• When we refer to large data in this chapter, we mean data that cause problems to work
with in terms of memory or speed but can still be handled by a single computer. We
start this chapter with an overview of the problems you face when handling large
datasets.

• Then we offer three types of solutions to overcome these problems: adapt your
algorithms, choose the right data structures, and pick the right tools. Data scientists
aren,t the only ones who must deal with large data volumes, so you can apply general
best practises to tackle the large data problem. Finally, we apply this knowledge to two
case studies. The first case shows you how to detect malicious URLs, and the second
case demonstrates how to build a recommender engine inside a database.
The Problems You Face when Handling Large Data
• A large volume of data poses new
challenges, such as overloaded memory
and algorithms that never stop running.
It forces you to adapt and expand your
repertoire of techniques. But even
when you can perform your analysis,
you should take care of issues such as
I/O (input/output) and CPU starvation,
because these can cause speed issues.

Figure: Overview of Problems Encountered when Working with More Data can
Fit in Memory
General Techniques for Handling Large Volumes of Data

• Never-ending algorithms, out-of-


memory errors, and speed issues are
the most common challenges you face
when working with large data. In this
section, we’ll investigate solutions to
overcome or alleviate these problems.
• The solutions can be divided into
three categories: using the correct
algorithms, choosing the right data
structure, and Figure:
using the Overview
right tools. of Solution for Handling Large Dataset
Choosing the Right Algorithm

• Choosing the right algorithm can solve


more problems than adding more or better
hardware. An algorithm that’s well suited
for handling large data doesn’t need to load
the entire data set into memory to make
predictions. Ideally, the algorithm also
supports parallelized calculations. In this
section we’ll dig into three types of
algorithms that can do that: online
algorithms, block algorithms, and
MapReduce algorithms, as shown in figure.
Figure: Overview of Techniques to Adapt Algorithms to Large Dataset
Choosing the Right Data Structure

• Algorithms can make or break your program, but the way you store your data
is of equal importance. Data structure have different storage requirements, but
also influence the performance of CRUD (create, read, update, and delete) and
other operations on the dataset. Figure shows you have many different data
structures to choose from, three of which will be here: sparse data, tree data,
and hash data.
Choosing the Right Data Structure (Contd…)
Selecting the Right Tools

• With the right class of


algorithms and data
structures in place, it’s time
to choose the right tool for
the job. The right tool can
be a Python library or at
least a tool that’s
controlled from Python, as
shown in figure. The
number of helpful tools
available is enormous, so
we’ll look at only handful
of them.
Ways to Handle Large Data Files for Machine Learning

1. Allocate More Memory: Some machine learning tools or libraries may be


limited by a default memory configuration. Check if you can re-configure your
tool or library to allocate more memory.

2. Work with a Smaller Sample: Take a random sample of your data, such as the
1,000 or 100,000 rows. Use this smaller sample to work through your problem
before fitting a final model on all of your data.

3. Change the Data Format: Is your data stored in raw ASCII text, like a CSV file?
Perhaps you can speed up data loading and use less memory by using another data
format. A good example is a binary format like GRIB, NetCDF, or HDF.
Ways to Handle Large Data Files for Machine Learning (Contd…)

4. Stream Data or Use Progressive Loading: Does all the data need to be in
memory at the same time? Perhaps you can use code or a library to stream or
progressively load data as-needed into memory for training.

5. Use a Relational Database: Relational databases provide a standard way of


storing and accessing very large datasets.

6. Use a Big Data Platform: In some cases, you may need to resort to a big data
platform.
Limited-Pass Algorithms

• The algorithms for frequent itemsets discussed so far use one pass for each
size of itemset we investigate. If main memory is too small to hold the
data and the space needed to count frequent itemsets of one size, there
does not seem to be any way to avoid k passes to compute the exact
collection of frequent itemsets. However, there are many applications
where it is not essential to discover every frequent itemset.

• For instance, if we are looking for items purchased together at a supermarket,


we are not going to run a sale based on every frequent itemset we find, so it
is quite sufficient to find most but not all the frequent itemsets.
Limited Pass Algorithm (Contd…)

• In this section we explore some algorithms that have been proposed to find all or
most frequent itemsets using at most two passes. We begin with the obvious
approach of using a sample of the data rather than the entire dataset. An
algorithm called SON uses two passes, gets the exact answer, and lends itself
to implementation by map-reduce or another parallel computing regime.
Finally, Toivonen’s Algorithm uses two passes on average, gets an exact
answer, but may, rarely, not terminate in any given amount of time.
Simple and Randomized Algorithm

1. In simple and randomized algorithm, we pick a random subset of the baskets and
pretend it is the entire dataset instead of using the entire file of baskets.

2. We must adjust the support threshold to reflect the smaller number of baskets.

3. For instance, if the support threshold for the full dataset is s, and we choose a sample
of 1% of the baskets, then we should examine the sample for itemsets that appear in at
least s/100 of the baskets.

4. The best way to pick the sample is to read the entire dataset, and for each basket,
select that basket for the sample with some fixed probability p.

5. Suppose there are m baskets in the entire file. At the end, we shall have a sample
Simple and Randomized Algorithm (Contd…)

6. However, if the baskets appear in random order in the file already, then we do not even
have to read the entire file.

7. We can select the first pm baskets for our sample. Or, if the file is part of a distributed
file system, we can pick some chunks at random to serve as the sample.

8. Having selected our sample of the baskets, we use part of main memory to store these
baskets.

9. Remaining main memory is used to execute one of the algorithms such as A-Priori or
PCY. However, the algorithm must run passes over the main-memory sample for each
itemset size, until we find a size with no frequent items.
Savasere, Omiecinski, and Navathe (SON) algorithm to find all or most frequent itemsets using at most two passes

1. The idea is to divide the input file into chunks.

2. Treat each chunk as a sample and run the simple and randomized Data Analytics
algorithm on that chunk.

3. We use ps as the threshold, if each chunk is fraction p of the whole file, and s is the
support threshold.

4. Store on disk all the frequent itemsets found for each chunk.

5. Once all the chunks have been processed in that way, take the union of all the itemsets
that have been found frequent for one or more chunks. These are the candidate itemsets.
SON algorithm to find all or most frequent itemsets using at most
two passes (Contd…)

6. If an itemset is not frequent in any chunk, then its support is less than ps in each
chunk. Since the number of chunks is 1/p, we conclude that the total support for that
itemset is less than (1/p)ps = s.

7. Thus, every itemset that is frequent in the whole is frequent in at least one chunk,
and we can be sure that all the truly frequent itemsets are among the candidates; i.e.,
there are no false negatives. We have made a total of one pass through the data as we
read each chunk and processed it.

8. In a second pass, we count all the candidate itemsets and select those that have
support at least s as the frequent itemsets.
SON Algorithm and MapReduce

1. The SON algorithm work well in a parallel-computing environment.

2. Each of the chunks can be processed in parallel, and the frequent itemsets
from each chunk combined to form the candidates.

3. We can distribute the candidates to many processors, have each processor


count the support for each candidate in a subset of the baskets, and finally sum
those supports to get the support for each candidate itemset in the whole dataset.

4. This process does not have to be implemented in map-reduce, but there is a


natural way of expressing each of the two passes as a MapReduce operation.
SON Algorithm and MapReduce (Contd…)

This map reduce and map-reduce sequence is summarised as below:


• First Map function :
a. Take the assigned subset of the baskets and find the itemsets frequent in the subset using
the simple and randomized algorithm.
b. Lower the support threshold from s to ps if each Map task gets fraction p of the total input
file.
c. The output is a set of key-value pairs (F, 1), where F is a frequent itemset from the sample.
• First Reduce Function :
a. Each Reduce task is assigned a set of keys, which are itemsets. Frequent Itemsets &
Clustering
b. The value is ignored, and the Reduce task simply produces those keys (itemsets) that
appear one or more times. Thus, the output of the first Reduce function is the candidate
SON Algorithm and MapReduce (Contd…)

• Second Map function :


a. The Map tasks for the second Map function take all the output from the first Reduce Function
(the candidate itemsets) and a portion of the input data file.
b. Each Map task counts the number of occurrences of each of the candidate itemsets among the
baskets in the portion of the dataset that it was assigned.
c. The output is a set of key-value pairs (C, v), where C is one of the candidate sets and v is the
support for that itemset among the baskets that were input to this Map task.

• Second Reduce function :


a. The Reduce tasks take the itemsets they are given as keys and sum the associated values.
b. The result is the total support for each of the itemsets that the Reduce ask was assigned to
handle.
c. Those itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce
task outputs these itemsets with their counts.
d. Itemsets that do not have total support at least s are not transmitted to the output of the Reduce
task.
Toivonen’s Algorithm

• Toivonen’s algorithm, given sufficient main memory, will use one pass over a
small sample and one full pass over the data. It will give neither false negatives
nor positives, but there is a small but finite probability that it will fail to produce
any answer at all. In that case it needs to be repeated until it gives an answer.
However, the average number of passes of passes needed before it produces all
and only the frequent itemsets is a small constant.

• Toivonen’s algorithm begins by selecting a small sample of the input dataset and
finding from it the candidate frequent itemsets.
Counting Frequent Items in a Stream

• The simplest approach to maintaining a current estimate of the frequent


itemsets in a stream is to collect some number of baskets and store it as a file.
Run one of the frequent-itemset algorithms, meanwhile ignoring the stream
elements that arrive, or storing them as another file to be analysed later. When
the frequent-itemsets algorithm finishes, we have an estimate of the frequent
itemsets in the stream. We then have several options as:

• We can use this collection of frequent itemsets for whatever application is at


hand but start running another iteration of the chosen frequent-itemset
algorithm immediately. This algorithm can either:
Counting Frequent Items in a Stream (Count…)

(a) Use the file that was collected while the first iteration of the algorithm was running. At the same
time, collect yet another file to be used at another iteration of the algorithm, when this current
iteration finishes.

(b) Start collecting another file of baskets now and run the algorithm when an adequate number of
baskets has been collected.

We can continue to count the numbers of occurrences of each of these frequent itemsets, along
with the total number of baskets seen in the stream, since the counting started. If any itemset is
discovered to occur in a fraction of the baskets that is significantly below the threshold fraction s,
then this set can be dropped from the collection of frequent itemsets. If not, we run the risk that we
shall encounter a short period in which a truly frequent itemset does not appear sufficiently
Counting Frequent Items in a Stream (Count…)
• We should also allow some way for new frequent itemsets to be added to the
current collection. Possibilities include:
(a) Periodically gather a new segment of the baskets in the stream and use it as the
data file for another iteration of the chosen frequent itemsets algorithm. The new
collection of frequent items is formed from the result of this iteration and the
frequent itemsets from the previous collection that have survived the possibility
of having been deleted for becoming infrequent.
(b) Add some random itemsets to the current collection, and count their fraction
of occurrences for a while, until one has a good idea of whether they are currently
frequent. Rather than choosing new itemsets completely at random, one might
focus on sets with items that appear in many itemsets already known to be
frequent.
Clustering Techniques
•Cluster analysis is the process of finding similar groups of objects to form clusters. It
is an unsupervised machine learning-based algorithm that acts on unlabeled data. A
group of data points would come together to form a cluster in which all the objects
would belong to the same group.
• Clustering is the process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity but are very dissimilar to
objects in other clusters. Dissimilarities and similarities are assessed based on the
attribute values describing the objects and often involve distance measures.
• Clustering is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are like one another, yet
dissimilar to objects in other clusters. Clustering as a data analysis tool has its roots in
many application areas such as biology, security and business intelligence and Web
search.
Clustering Techniques (Contd…)

• Clustering or cluster analysis is a machine learning technique, which groups the


unlabeled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another group.
"It does it by finding some similar patterns in the unlabeled dataset such as
shape, size, colour, behaviour, etc., and dividing them as per the presence and
absence of those similar patterns.
• It is an unsupervised learning method; hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset. After applying this clustering
technique, each cluster or group is provided with a clusterID. ML system can
use this id to simplify the processing of large and complex datasets.
Clustering Techniques (Contd…)

Clustering only utilises input data, to determine patterns, anomalies, or


similarities in its input data. A good clustering algorithm aims to obtain clusters
whose:

• The intra-cluster similarities are high; this implies that the data present inside
the cluster is similar to one another.

• The inter-cluster similarity is low, and it means each cluster holds data that is
not similar to other data.
Applications of Cluster Analysis in Data Mining

• In many applications, clustering analysis is widely used, such as data analysis,


market research, pattern recognition, and image processing.

• It assists marketers finding different groups in their client base based on their
purchasing patterns. They can characterise their customer groups.

• It helps in with allocating documents on the internet for data discovery.

• Clustering is also used in tracking applications such as detection of credit card fraud.

• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data and analyse the characteristics of each cluster.
Requirements for Cluster Analysis

The following are typical requirements of clustering in data mining:

1. Scalability: Many clustering algorithms work well on small datasets


containing fewer than several hundred data objects; however, a large database
may contain millions or even billions of objects, particularly in Web search
scenarios. Clustering on only a sample of a given large dataset may lead to
biased results. Therefore, highly scalable clustering algorithms are needed.
Scalability in clustering implies that as the number of data objects increases, the
time required to perform clustering should roughly scale to the complexity of the
algorithm.
Requirements for Cluster Analysis (Contd…)

2. Ability to deal with different types of attributes: Many algorithms are


designed to cluster numeric (interval-based) data. However, applications may
require clustering other data types, such as binary, nominal (categorical), and
ordinal data, or mixtures of these data types. Recently, more and more
applications need clustering techniques for complex data types such as graphs,
sequences, images, and documents.
3. Discovery of clusters with arbitrary shape: Many clustering algorithms
determine clusters based on Euclidean distance measures. Algorithms based on
such distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. It is important to develop
algorithms that can detect clusters of arbitrary shape.
Requirements for Cluster Analysis (Contd…)
4. Requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to provide domain knowledge in the
form of input parameters such as the desired number of clusters. Consequently,
the clustering results may be sensitive to such parameters. Parameters are often
hard to determine, especially for high-dimension datasets where users have yet to
grasp a deep understanding of their data
5. Ability to deal with noisy data: Most real-world datasets contain outliers
and/or missing, unknown data. Sensor readings, for example, are often noisy-
some readings may be inaccurate due to the sensing mechanisms, and some
readings may be erroneous due to interferences from surrounding transient
objects. Clustering algorithms can be sensitive to such noise and may produce
poor-quality clusters. Therefore, we need clustering methods that are robust to
noise.
Requirements for Cluster Analysis (Contd…)
6. Incremental clustering and insensitivity to input order: In many
applications, incremental updates (representing newer data) may arrive at any
time. Some clustering algorithms cannot incorporate incremental updates into
existing clustering structures and, instead, must be recomputed to a new
clustering from scratch. Clustering algorithms may also be sensitive to the input
data order. Incremental clustering algorithms and algorithms that are insensitive
to the input order are needed.
7. Capability of clustering high-dimensional data: A dataset can contain
numerous dimensions or attributes. When clustering documents, for example,
each keyword can be regarded as a dimension, and there are often thousands of
keywords. Most clustering algorithms are good at handling low-dimensional data
such as datasets involving only two or three dimensions. Finding clusters of data
objects in a high dimensional space is challenging, especially considering that
such data can be very sparse and highly skewed.
Requirements for Cluster Analysis (Contd…)
8. Constraint-based clustering: Real-world applications may need to perform
clustering under various kinds of constraints. A challenging task is to find data
groups with good clustering behaviour that satisfy specified constraints.

9. Interpretability and usability: Users want clustering results to be


interpretable, comprehensible, and usable. That is, clustering may need to be tied
into specific semantic interpretations and applications. It is important to study
how an application goal may influence the selection of clustering features and
clustering methods.
Types of Clustering

There are different types of clustering algorithms that handle all kinds of unique data.
1. Partitioning methods: Given a set of n objects, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k<= n. That is, it
divides the data into k groups such that each group must contain at least one object.
The basic partitioning methods typically adopt exclusive cluster separation. That is,
each object must belong to exactly one group. Most partitioning methods are
distance-based. The general criterion of a good partitioning is that objects in the same
cluster are 'close' or related to each other, whereas objects in different clusters are 'far
apart' or very different.
Types of Clustering (Contd…)

2. Hierarchical methods: A hierarchical method creates a hierarchical


decomposition of the given set of data objects. A hierarchical method can be
classified as:
• The Agglomerative Approach: The agglomerative approach, also called the
bottom-up approach, starts with each object forming a separate group. It
successively merges the objects or groups close to one another, until all the
groups are merged into one (the topmost level of the hierarchy), or a
termination condition holds.
• The Divisive approach: The divisive approach, also called the top-down
approach, starts with all the objects in the same cluster. In each successive
iteration a cluster is split into smaller clusters, until each object is in one cluster,
or a termination condition holds. Hierarchical clustering methods can be
distance based, or density and continuity based.
Types of Clustering (Contd…)

3. Density-based methods: The density-based method mainly focuses on


density. In this method, the given cluster will keep on growing continuously as
long as the density in the neighborhood exceeds some threshold, i.e., for each
data point within a given cluster. The radius of a given cluster has to contain at
least a minimum number of points. Density-based methods can divide a set of
objects into multiple exclusive clusters or a hierarchy of clusters. Density-based
methods consider exclusive clusters only, and do not consider fuzzy clusters.
Types of Clustering (Contd…)

4. Grid-based Methods: Grid-based methods quantise the object space into a


finite number of cells that form a grid structure. All the clustering operations are
performed on the grid structure (i.e., on the quantised space). The main
advantage of this approach is its fast-processing time, which is typically
independent of the number of data objects and dependent only on the number of
cells in each dimension in the quantised space. Using grids is often an efficient
approach to many spatial data mining problems, including clustering.
Types of Clustering (Contd…)
Hierarchical Clustering in Machine Learning

• Hierarchical clustering is another unsupervised machine learning algorithm, which is


used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

• In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

• Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
Hierarchical Clustering in Machine Learning (Contd…)

The hierarchical clustering technique has two approaches:


1. Agglomerative: Agglomerative is a bottom-up approach, in which the
algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as
it is a top-down approach.
Why Hierarchical Clustering?

• As we already have other clustering algorithms such as K-Means Clustering,


then why we need hierarchical clustering? So, as we have seen in the K-means
clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of
the same size. To solve these two challenges, we can opt for the hierarchical
clustering algorithm because, in this algorithm, we don't need to have
knowledge about the predefined number of clusters.

• In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.


Agglomerative Hierarchical Clustering

• The agglomerative hierarchical clustering algorithm is a popular example of


HCA. To group the datasets into clusters, it follows the bottom-up
approach. It means, this algorithm considers each dataset as a single cluster
at the beginning and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster
that contains all the datasets.
• This hierarchy of clusters is represented in the form of the dendrogram.
How the Agglomerative Hierarchical clustering Work?
• The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let’s say there are N data points, so the number
of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
How the Agglomerative Hierarchical clustering Work? (Contd…)

Step 3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:

Step-5: Once all the clusters are combined into one big cluster, develop
Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for
the hierarchical clustering. There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage
methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest points of


the clusters. Consider the below image:
Measure for the distance between two clusters (Contd…)

2. Complete Linkage: It is the farthest distance between the two points of two
different clusters It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
Measure for the distance between two clusters (Contd…)

3. Average Linkage: It is the linkage method in which the distance


between each pair of datasets is added up and then divided by the total
number of datasets to calculate the average distance between two clusters.
It is also one of the most popular linkage methods.
Measure for the distance between two clusters (Contd…)

4. Centroid Linkage: It is the linkage method in which the distance


between the centroid of the clusters is calculated. Consider the below
image:

• From the above-given approaches, we can apply any of them according


to the type of problem or business requirement.
Woking of Dendrogram in Hierarchical clustering
• The dendrogram is a tree-like structure that is mainly used to store each
step as a memory that the HC algorithm performs. In the dendrogram plot,
the Y-axis shows the Euclidean distances between the data points, and the
x-axis shows all the data points of the given dataset.

• The working of the dendrogram can be explained using the below diagram:
Woking of Dendrogram in Hierarchical clustering (Contd…)
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.

• As we have discussed above, firstly, the datapoints P2 and P3 combine and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular
shape. The Hight is decided according to the Euclidean distance between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It
is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater
than the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
K-Means: A centroid-Based Technique (Partitioning Based
Method)

• K-Means Clustering is an unsupervised learning algorithm that is used to


solve the clustering problems in machine learning or data science. It groups
the unlabeled dataset into different clusters.
• The main aim of this algorithm is to minimise the sum of distance between
the data points and their corresponding clusters.
What is K-Means Clustering ?

• K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

• It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.
What is K-Means Clustering ? (Contd…)

• It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own
without the need for any training.

• It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means Clustering

The k-means clustering algorithm mainly performs two tasks:

1. Determines the best value for K center points or centroids by an iterative


process.

2. Assigns each data point to its closest k-center. Those data points which are
near to the k-center, create a cluster.
The k-means Clustering (Contd…)

The below diagram explains the working of the K-means Clustering Algorithm:
k-means Clustering Algorithm

K-means: The k-means algorithm for partitioning, where each cluster's center is represented by the
mean value of the objects in the cluster.
Input:
K: the number of clusters,
D: a dataset containing n objects.
Output: A set of K clusters.
Method:
1. Arbitrarily choose K objects from D as the initial cluster centers;
2. Repeat
3. (Re) assign each object to the cluster to which the object is the most similar, based on the mean
value of the objects in the cluster;
4. Update the cluster means, that is, calculate the mean value of the objects for each cluster;
5. Until there is a change.
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

• Step-1: Select the number K to decide the number of clusters.

• Step-2: Select random K points or centroids. (It can be other from the input dataset).

• Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

• Step-4: Calculate the variance and place a new centroid of each cluster.

• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

• Step-7: The model is ready.


Advantages of K-Means Clustering Algorithm

• It is fast and Robust.


• Comparatively efficient.
• If datasets are distinct, then gives the best results.
• Produce tighter clusters.
• When centroids are recomputed, the cluster changes.
• Flexible & Easy to interpret.
• Better computational cost.
• Enhances Accuracy.
• Works better with spherical clusters.
Disadvantages of K-Means Clustering Algorithm

• Needs prior specification for the number of cluster centers.


• If there are two highly overlapping data, then it cannot be distinguished
and cannot tell that there are two clusters.
• Euclidean distance can unequally weigh the factors.
• It gives the local optima of the squared error function.
• Sometimes choosing the centroids randomly cannot give fruitful results.
• It can be used only if the meaning is defined.
• Cannot handle outliers and noisy data.
• Do not work for the non-linear dataset.
K-Means Clustering
Clustering High-Dimensional Data

• Clustering of the High-Dimensional Data return the group of objects which


are clusters. It is required to group similar types of objects together to
perform the cluster analysis of high-dimensional data, But the High-
Dimensional data space is huge, and it has complex data types and
attributes. A major challenge is that we need to find out the set of attributes
that are present in each cluster. A cluster is defined and characterized based
on the attributes present in the cluster.
• The High-Dimensional data is reduced to low-dimension data to make the
clustering and search for clusters simple. Some applications need the
appropriate models of clusters, especially the high-dimensional data.
Clusters in the high-dimensional data are significantly small.
Subspace Clustering Methods

There are 3 Subspace Clustering Methods:

 Subspace search methods

 Correlation-based clustering methods

 Bi-clustering methods

Subspace clustering approaches to search for clusters existing in subspaces


of the given high-dimensional data space, where a subspace is defined using
a subset of attributes in the full space.
Subspace Clustering Methods (Contd…)

1. Subspace Search Methods: A subspace search method searches the subspaces for
clusters. Here, the cluster is a group of similar types of objects in a subspace. The
similarity between the clusters is measured by using distance or density features.
CLIQUE algorithm is a subspace clustering method. subspace search methods search a
series of subspaces. There are two approaches in Subspace Search Methods: Bottom-up
approach starts to search from the low-dimensional subspaces. If the hidden clusters are
not found in low-dimensional subspaces then it searches in higher dimensional subspaces.
The top-down approach starts to search from the high-dimensional subspaces and then
search in subsets of low-dimensional subspaces. Top-down approaches are effective if
the subspace of a cluster can be defined by the local neighborhood sub-space clusters.
Subspace Clustering Methods (Contd…)

2. Correlation-Based Clustering: correlation-based approaches discover the


hidden clusters by developing advanced correlation models. Correlation-
Based models are preferred if is not possible to cluster the objects by using
the Subspace Search Methods. Correlation-Based clustering includes the
advanced mining techniques for correlation cluster analysis. Biclustering
Methods are the Correlation- Based clustering methods in which both the
objects and attributes are clustered.
Subspace Clustering Methods (Contd…)
3. Bi-clustering Methods:

Bi-clustering means clustering the data based on the two factors. we can cluster both objects and
attributes at a time in some applications. The resultant clusters are biclusters. To perform the bi-
clustering there are four requirements:

• Only a small set of objects participate in a cluster.

• A cluster only involves a small number of attributes.

• The data objects can take part in multiple clusters, or the objects may also include in any cluster.

• An attribute may be involved in multiple clusters.

Objects and attributes are not treated in the same way. Objects are clustered according to their
attribute values. We treat Objects and attributes as different in biclustering analysis.
CLIQUE (Clustering in QUEst)

CLIQUE is a density-based and grid-based subspace clustering algorithm. So,


let’s first take a look at what is a grid and density-based clustering technique.

• Grid-Based Clustering Technique: In Grid-Based Methods, the space of


instance is divided into a grid structure. Clustering techniques are then
applied using the Cells of the grid, instead of individual data points, as the
base units.

• Density-Based Clustering Technique: In Density-Based Methods, A


cluster is a maximal set of connected dense units in a subspace.
CLIQUE Algorithm
• CLIQUE Algorithm uses density and grid-based technique i.e. subspace
clustering algorithm and finds out the cluster by taking density threshold
and several grids as input parameters. It is specially designed to handle
datasets with many dimensions. CLIQUE Algorithm is very scalable with
respect to the value of the records, and several dimensions in the dataset
because it is grid-based and uses the Apriori Property effectively. APRIORI
APPROACH ?.
• Apriori Approach Stated that If an X dimensional unit is dense then all its
projections in X-1 dimensional space are also dense.
• This means that dense regions in each subspace must produce dense regions
when projected to a low-dimensional subspace. CLIQUE restricts its search
for high- dimensional dense cells to the intersection of dense cells in the
subspace because CLIQUE uses apriori properties.
Working of CLIQUE Algorithm

• The CLIQUE algorithm first divides the data space into grids. It is done by dividing
each dimension into equal intervals called units. After that, it identifies dense units.
A unit is dense if the data points in this are exceeding the threshold value.

• Once the algorithm finds dense cells along one dimension, the algorithm tries to find
dense cells along two dimensions, and it works until all dense cells along the entire
dimension are found.

• After finding all dense cells in all dimensions, the algorithm proceeds to find the
largest set (“cluster”) of connected dense cells. Finally, the CLIQUE algorithm
generates a minimal description of the cluster. Clusters are then generated from all
dense subspaces using the apriori approach.
Advantage of CLIQUE Algorithm

• CLIQUE is a subspace clustering algorithm that outperforms K-means,


DBSCAN, and Farthest First in both execution time and accuracy.

• CLIQUE can find clusters of any shape and is able to find any number of
clusters in any number of dimensions, where the number is not predetermined
by a parameter.

• One of the simplest methods, and interpretability of results.


Disadvantage of CLIQUE Algorithm

 The main disadvantage of CLIQUE Algorithm is that if the size of the


cell is unsuitable for a set of very high values, then too much of the
estimation will take place and the correct cluster will be unable to find.
PROCLUS (Projected Clustering)

• Projected clustering is the first, top-down partitioning projected clustering algorithm based on the
notion of k- medoid clustering which was presented by Aggarwal (1999). It determines medoids
for each cluster repetitively on a sample of data using a greedy hill climbing technique and then
upgrades the results repetitively. Cluster quality in projected clustering is a function of average
distance between data points and the closest medoid. Also, the subspace dimensionality is an input
framework which generates clusters of alike sizes.

1. Projected clustering (PROCLUS) is a top-down subspace clustering algorithm.

2. PROCLUS samples the data and then selects a set of k-medoids and iteratively improves the
clustering.

3. PROCLUS is actually faster than CLIQUE due to the sampling of large data sets. Frequent
Itemsets & Clustering
PROCLUS (Projected Clustering) (Contd…)

The three phases of PROCLUS are as follows :

a) Initialization phase : Select a set of potential medoids that are far apart using a greedy algorithm.

b) Iteration phase :

i. Select a random set of k-medoids from this reduced data set to determine if clustering quality
improves by replacing current medoids with randomly chosen new medoids.

ii. Cluster quality is based on the average distance between instances and the nearest medoid.

iii. For each medoid, a set of dimensions is chosen whose average distances are small compared to
statistical expectation.

iv. Once the subspaces have been selected for each medoid, average Manhattan segmental distance is
used to assign points to medoids, forming dusters.
PROCLUS (Projected Clustering) (Contd…)
c) Refinement phase :
i. Compute a new list of relevant dimensions for each medoid
based on the clusters formed and reassign points to medoids,
removing outliers.
ii. The distance-based approach of PROCLUS is biased toward
clusters that are hype-spherical in shape.
Clustering For Streams and Parallelism

• Data stream clustering refers to the clustering of data that arrives continually
such as financial transactions, multimedia data, or telephonic records. It is usually
studied as a “Streaming Algorithm.” The purpose of Data Stream Clustering
is to contract a good clustering stream using a small amount of time and memory.

• Technically, Clustering is the act of grouping elements using sets. The main
purpose of this type of separation is to unite items that are similar to each other,
using the comparison of a series of characteristics of these. When we talk about
Data Stream, we can separate its methods into five categories, namely
partitioning, hierarchical, density-based, grid-based and model-based.
Clustering For Streams and Parallelism (Contd…)

• There is one more factor to take into account when talking about clusters. It
is possible to divide the possible distances in four, being the minimum (or
single connection), maximum (or complete connection), mean distance and
the average, and each one has its characteristics regarding the cost of
implementation and computational power, being that the minimum distance
and mean distance are more common to use in Data Stream Clustering.
Basic Subspace Clustering Approaches
1. Grid-based subspace clustering :
a) In this approach, data space is divided into axis-parallel cells. Then the
cells containing objects above a predefined threshold value given as a
parameter are merged to form subspace clusters. Number of intervals is
another input parameter which defines range of values in each grid.
b) Apriori property is used to prune non-promising cells and to improve
efficiency.
c) If a unit is found to be dense in k – 1 dimension, then it is considered for
finding dense unit in k dimensions.
d) If grid boundaries are strictly followed to separate objects, accuracy of
clustering result is decreased as it may miss neighboring objects which get
separated by string grid boundary. Clustering quality is highly dependent
on input parameters.
Basic Subspace Clustering Approaches (Contd…)

2. Window-based subspace clustering :

a) Window-based subspace clustering overcomes drawbacks of cell-based


subspace clustering that it may omit significant results.

b) Here a window slides across attribute values and obtains overlapping


intervals to be used to form subspace clusters.

c) The size of the sliding window is one of the parameters. These algorithms
generate axis-parallel subspace clusters.
Basic Subspace Clustering Approaches (Contd…)
3. Density- based subspace clustering :
a) A density-based subspace clustering overcome drawbacks of grid-based
subspace clustering algorithms by not using grids.
b) A cluster is defined as a collection of objects forming a chain which fall
within a given distance and exceed predefined threshold of object count. Then
adjacent dense regions are merged to form bigger clusters.
c) As no grids are used, these algorithms can find arbitrarily shaped subspace
clusters.
d) Clusters are built by joining together the objects from adjacent dense regions.
e) These approaches are prone to values of distance parameters.
f) The effect curse of dimensionality is overcome in density-based algorithms
by utilizing a density measure which is adaptive to subspace size.
Clustering High-Dimensional Data
• Most clustering methods are designed for clustering low-dimensional data and
encounter challenges when the dimensionality of the data grows really high
(say, over 10 dimensions, or even over thousands of dimensions for some
tasks). This is because when the dimensionality increases, usually only a small
number of dimensions are relevant to certain clusters, but data in the irrelevant
dimensions may produce much noise and mask the real clusters to be
discovered.
• To overcome this difficulty, we may consider using feature for attribute)
transformation and feature (or attribute) selection transformation methods, such
as principal component analysis and singular value decom Festion, transform
the data onto a smaller space while generally preserving the original relative
distance between objects. They summarise data by creating linear combinations
of the attributes and may discover hidden structures in the data.
Clustering High-Dimensional Data (Contd…)

• Another way of tackling the curse of dimensionality is to try to remove some of the
dimensions. Attribute subset selection (or feature subset selection) is commonly used for data
reduction by removing irrelevant or redundant dimensions (or attributes).
• Given a set of attributes, attribute subset selection finds the subset of attributes that are most
relevant to the data mining task. It is most commonly performed by supervised learning-the
most relevant set of attributes are found with respect to the given class labels. It can also be
performed by an unsupervised process, such as entropy analysis, which is based on the
property that entropy tends to be low for data that contain tight clusters. Other evaluation
functions, such as category utility, may also be used.
• Subspace clustering is an extension to attribute subset selection that has shown its strength at
high-dimensional clustering. It is based on the observation that different subspaces may
contain different, meaningful clusters. Subspace clustering searches for groups of clusters
within different subspaces of the same dataset. The problem becomes how to find such
subspace clusters effectively and efficiently.
Frequent Pattern-Based Clustering Methods
• Frequent pattern mining, as the name implies, searches for patterns (such as sets
of items or objects) that occur frequently in large datasets.
• Frequent pattern mining can lead to the discovery of interesting associations and
correlations among data objects. The idea behind frequent pattern-based cluster
analysis is that the frequent patterns discovered may also indicate clusters.
• Frequent pattern-based cluster analysis is well suited to high-dimensional data. It
can be viewed as an extension of the dimension-growth subspace clustering
approach. However, the boundaries of different dimensions are not obvious,
since here they are represented by sets of frequent itemsets. That is, rather than
growing the clusters dimension by dimension, we grow sets of frequent itemsets,
which eventually lead to cluster descriptions.
• Typical examples of frequent pattern-based cluster analysis include the
clustering of text documents that contain thousands of distinct keywords, and the
analysis of microarray data that contain tens of thousands of measured values or
'features
Frequent Term-Based Text Clustering
• In frequent term-based text clustering, text documents are clustered based on
the frequent terms they contain. Using the vocabulary of text document
analysis, a term is any sequence of characters separated from other terms by a
delimiter. A term can be made up of a single word or several words.
• In general, we first remove nontext information (such as HTML tags and
punctuation) and stop words. Terms are then extracted. A stemming algorithm
is then applied to reduce each term to its basic stem. In this way, each
document can be represented as a set of terms. Each set is typically large.
Collectively, a large set of documents will contain a very large set of distinct
terms. If we treat each term as a dimension, the dimension space will be of very
high dimensionality.
Clustering in Non-Euclidean Spaces
• Now we discuss an algorithm that handles non-main-memory data but does not
require a Euclidean space. The algorithm, which we shall refer to as GRGPF for
its authors (V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French),
takes ideas from both hierarchical and point-assignment approaches. Like
CURE, it represents clusters by sample points in main memory.
• However, it also tries to organise the clusters hierarchically, in a tree, so a new
point can be assigned to the appropriate cluster by passing it down the tree.
Leaves of the tree hold summaries of some clusters, and interior nodes hold
subsets of the information describing the clusters reachable through that node.
An attempt is made to group clusters by their distance from one another, so the
clusters at a leaf are close, and the clusters reachable from one interior node are
relatively close as well.
Clustering in Non-Euclidean Spaces (Contd…)

Representing Clusters in the GRGPF Algorithm

• As we assign points to clusters, the clusters can grow large. Most of the points
in a cluster are stored on disk and are not used in guiding the assignment of
points, although they can be retrieved. The representation of a cluster in main
memory consists of several features. Before listing these features, if p is any
point in a cluster, let ROWSUM(p) be the sum of the squares of the distances
from p to each of the other points in the cluster. Note that, although we are not
in a Euclidean space, there is some distance measure d that applies to points, or
else it is not possible to cluster points at all.
Clustering in Non-Euclidean Spaces (Contd…)
The following features form the representation of a cluster:
1. N, the number of points in the cluster.
2. The clustroid of the cluster, which is defined specifically to be the point in the
cluster that minimises the sum of the squares of the distances to the other points; that
is, the clustroid is the point in the cluster with the smallest ROWSUM.
3. The rowsum of the clustroid of the cluster.
4. For some chosen constant k, the k points of the cluster that are closest to the
clustroid, and their rowsums. These points are part of the representation in case the
addition of points to the cluster causes the clustroid to change. The assumption is
made that the new clustroid would be one of these k points near the old clustroid.
5. The k points of the cluster that are furthest from the clustroid and their rowsums.
These points are part of the representation so that we can consider whether two
clusters are close enough to merge. The assumption is made that if two clusters are
close, then a pair of points distant from their respective clustoids would be close.

You might also like