Data Analytics Unit 4
Data Analytics Unit 4
For example it is likely to find that if a customer buys Milk and bread he/she also
buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can
suggest the customer to buy butter if he/she buys Milk and Bread.
Important Definitions :
The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets
from the large dataset.
Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Example: Suppose we have the following dataset that has various transactions, and
from this dataset, we need to find the frequent itemsets and generate the association
rules using the Apriori algorithm:
Solution:
o In the first step, we will create a table that contains support count (The
frequency of each itemset individually in the dataset) of each itemset in the
given dataset. This table is called the Candidate set or C1.
o Now, we will take out all the itemsets that have the greater support count that
the Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum
support, except the E, so E itemset will be removed.
o In this step, we will generate C2 with the help of L1. In C2, we will create the
pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred
together in the given dataset. So, we will get the below table for C2:
o Again, we need to compare the C2 Support count with the minimum support
count, and after comparing, the itemset with less support count will be
eliminated from the table C2. It will give us the below table for L2
o For C3, we will repeat the same two processes, but now we will form the C3
table with subsets of three itemsets together, and will calculate the support
count from the dataset. It will give the below table:
o Now we will create the L3 table. As we can see from the above C3 table, there
is only one combination of itemset that has support count equal to the
minimum support count. So, the L3 will have only one combination, i.e., {A,
B, C}.
As the given threshold or minimum confidence is 50%, so the first three rules A ^B
→ C, B^C → A, and A^C → B can be considered as the strong association rules for
the given problem.
1. Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong
association rules satisfy both minimumsupport and minimum confidence).
2. This can be done using equation (4.6.1) for confidence, which is shown here for
completeness :
6. Frequent itemsets can be stored ahead of time in hash tables along with their counts
so that they can be accessed quickly.
Related concepts :
1. Let items be words, and let baskets be documents (e.g., Web pages,blogs, tweets).
2. A basket/document contains those items/words that are present in the document.
3. If we look for sets of words that appear together in many documents, the sets will
be dominated by the most common words (stop words).
4. If the document contain many the stop words such as “and” and “a” then it will
consider as more frequent itemsets.
5. However, if we ignore all the most common words, then we would hope to find
among the frequent pairs some pairs of words that represent a joint concept.
Plagiarism :
1.Let the items be documents and the baskets be sentences.
2. An item is in a basket if the sentence is in the document.
3. This arrangement appears backwards, and we should remember that the
relationship between items and baskets is an arbitrary many-many relationship.
4. In this application, we look for pairs of items that appear togetherin several baskets.
5. If we find such a pair, then we have two documents that share several sentences in
common.
Biomarkers :
1. Let the items be of two types such as genes or blood proteins, and diseases.
2. Each basket is the set of data about a patient: their genome and blood-chemistry
analysis, as well as their medical history of disease.
3. A frequent itemset that consists of one disease and one or more biomarkers suggest
a test for the disease.
A large volume of data poses new challenges, such as overloaded memory and
algorithms that never stop running. It forces you to adapt and expand your repertoire
of techniques. But even when you can perform your analysis, you should take care of
issues such as I/O (input/output) and CPU starvation, because these can cause speed
issues.
General techniques for handling large volumes of data
Never-ending algorithms, out-of-memory errors, and speed issues are the most
common challenges you face when working with large data. In this section, we’ll
investigate solutions to overcome or alleviate these problems.
The solutions can be divided into three categories: using the correct algorithms,
choosing the right data structure, and using the right tools.
1. In first pass of Apriori algorithm, there may be much unused space in main
memory.
2. The PCY Algorithm uses the unused space for an array of integers that generalizes
the idea of a Bloom filter. The idea is shown schematically in Fig.
Item Item Fre
names Item
names quent
to counts
to items
integers integer
Bitmap
Hash table
for bucket Data structure
Pass 1 Pass 2
3. Array is considered as a hash table, whose buckets hold integers rather than sets of
keys or bits. Pairs of items are hashed to buckets of this hash table. As we examine a
basket during the first pass, we not only add 1 to the count for each item in the basket,
but we generate all the pairs, using a double loop.
4. We hash each pair, and we add 1 to the bucket into which that pair hashes.
5. At the end of the first pass, each bucket has a count, which is the sum of the counts
of all the pairs that hash to that bucket.
6. If the count of a bucket is at least as great as the support threshold s, it is called a
frequent bucket. We can say nothing about the pairs that hash to a frequent bucket;
they could all be frequent pairs from the information available to us.
7. But if the count of the bucket is less than s (an infrequent bucket), we know no pair
that hashes to this bucket can be frequent, even if the pair consists of two frequent
tems. Frequent Itemsets & Clustering
8. We can define the set of candidate pairs C2 to be those pairs {i, j} such that:
a. i and j are frequent items.
b. {i, j} hashes to a frequent bucket.
Limited-Pass Algorithms
The algorithms for frequent itemsets discussed so far use one pass for each size of
itemset we investigate. If main memory is too small to hold the data and the space
needed to count frequent itemsets of one size, there does not seem to be any way to
avoid k passes to compute the exact collection of frequent itemsets. However, there
are many applications where it is not essential to discover every frequent itemset. For
instance, if we are looking for items purchased together at a supermarket, we are not
going to run a sale based on every frequent itemset we find, so it is quite sufficient to
find most but not all of the frequent itemsets. In this section we explore some
algorithms that have been proposed to find all or most frequent itemsets using at most
two passes. We begin with the obvious approach of using a sample of the data rather
than the entire dataset. An algorithm called SON uses two passes, gets the exact
answer, and lends itself to implementation by map-reduce or another parallel
computing regime. Finally, Toivonen’s Algorithm uses two passes on average, gets an
exact answer, but may, rarely, not terminate in any given amount of time.
SON algorithm to find all or most frequent itemsets using at most two passes.
MapReduce-MapReduce sequence :
Density-based
In density-based clustering, data is grouped by areas of high concentrations of data
points surrounded by areas of low concentrations of data points. Basically the
algorithm finds the places that are dense with data points and calls those clusters.The
great thing about this is that the clusters can be any shape. You aren't constrained to
expected conditions.The clustering algorithms under this type don't try to assign
outliers to clusters, so they get ignored.
Distribution-based
With a distribution-based clustering approach, all of the data points are considered
parts of a cluster based on the probability that they belong to a given cluster.It works
like this: there is a center-point, and as the distance of a data point from the center
increases, the probability of it being a part of that cluster decreases.If you aren't sure
of how the distribution in your data might be, you should consider a different type of
algorithm.
Centroid-based
Centroid-based clustering is the one you probably hear about the most. It's a little
sensitive to the initial parameters you give it, but it's fast and efficient.These types of
algorithms separate data points based on multiple centroids in the data. Each data
point is assigned to a cluster based on its squared distance from the centroid. This is
the most commonly used type of clustering.
Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you would get
from a company database or taxonomies. It builds a tree of clusters so everything is
organized from the top-down.This is more restrictive than the other clustering types,
but it's perfect for specific kinds of data sets.
Hierarchical Clustering in Machine Learning
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
ep-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:
From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.
The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding dendrogram.
We can cut the dendrogram tree structure at any level as per our requirement.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
Hence each cluster has data points with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Clustering of the High-Dimensional Data return the group of objects which are
clusters. It is required to group similar types of objects together to perform the
cluster analysis of high-dimensional data, But the High-Dimensional data space is
huge and it has complex data types and attributes. A major challenge is that we need
to find out the set of attributes that are present in each cluster. A cluster is defined
and characterized based on the attributes present in the cluster. Clustering High-
Dimensional Data we need to search for clusters and find out the space for the
existing clusters.
The High-Dimensional data is reduced to low-dimension data to make the clustering
and search for clusters simple. some applications need the appropriate models of
clusters, especially the high-dimensional data. clusters in the high-dimensional data
are significantly small. the conventional distance measures can be ineffective.
Instead, To find the hidden clusters in high-dimensional data we need to apply
sophisticated techniques that can model correlations among the objects in subspaces.
Subspace Clustering Methods: There are 3 Subspace Clustering Methods:
CLIQUE
CLIQUE is a density-based and grid-based subspace clustering algorithm. So let’s
first take a look at what is a grid and density-based clustering technique.
Grid-Based Clustering Technique: In Grid-Based Methods, the space of
instance is divided into a grid structure. Clustering techniques are then applied
using the Cells of the grid, instead of individual data points, as the base units.
Density-Based Clustering Technique: In Density-Based Methods, A cluster
is a maximal set of connected dense units in a subspace.
CLIQUE Algorithm:
CLIQUE Algorithm uses density and grid-based technique i.e subspace clustering
algorithm and finds out the cluster by taking density threshold and a number of grids
as input parameters. It is specially designed to handle datasets with a large number
of dimensions.CLIQUE Algorithm is very scalable with respect to the value of the
records, and a number of dimensions in the dataset because it is grid-based and uses
the Apriori Property effectively. APRIORI APPROACH ?.
Apriori Approach Stated that If an X dimensional unit is dense then all its
projections in X-1 dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when
projected to a low-dimensional subspace. CLIQUE restricts its search for high-
dimensional dense cells to the intersection of dense cells in the subspace because
CLIQUE uses apriori properties.
The CLIQUE algorithm first divides the data space into grids. It is done by dividing
each dimension into equal intervals called units. After that, it identifies dense units.
A unit is dense if the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find
dense cells along two dimensions, and it works until all dense cells along the entire
dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the
largest set (“cluster”) of connected dense cells. Finally, the CLIQUE algorithm
generates a minimal description of the cluster. Clusters are then generated from all
dense subspaces using the apriori approach.
Advantage:
Disadvantage:
The main disadvantage of CLIQUE Algorithm is that if the size of the cell is
unsuitable for a set of very high values, then too much of the estimation will
take place and the correct cluster will be unable to find.
PROCLUS. (Projected Clustering) :
Projected clustering is the first, top-down partitioning projected clustering
algorithm based on the notion of k- medoid clustering which was presented by
Aggarwal (1999). It determines medoids for each cluster repetitively on a sample of
data using a greedy hill climbing technique and then upgrades the results
repetitively. Cluster quality in projected clustering is a function of average distance
between data points and the closest medoid. Also, the subspace dimensionality is an
input framework which generates clusters of alike sizes.
Clustering For Streams and Parallelism:- Data stream clustering refers to the
clustering of data that arrives continually such as financial transactions, multimedia
data, or telephonic records. It is usually studied as a “Streaming Algorithm.” The
purpose of Data Stream Clustering is to contract a good clustering stream using a
small amount of time and memory.
Technically, Clustering is the act of grouping elements using sets. The main purpose
of this type of separation is to unite items that are similar to each other, using the
comparison of a series of characteristics of these. When we talk about Data Stream,
we can separate its methods into five categories, namely partitioning, hierarchical,
density-based, grid-based and model-based.
There is one more factor to take into account when talking about clusters. It is
possible to divide the possible distances in four, being the minimum (or single
connection), maximum (or complete connection), mean distance and the average, and
each one has its characteristics regarding the cost of implementation and
computational power, being that the minimum distance and mean distance are more
common to use in Data Stream Clustering.
Basic subspace clustering approaches are :
1. Grid-based subspace clustering :
a. In this approach, data space is divided into axis-parallel cells. Then the cells
containing objects above a predefined threshold value given as a parameter are
merged to form subspace clusters. Number of intervals is another input parameter
which defines range of values in each grid.
b. Apriori property is used to prune non-promising cells and to improve efficiency.
c. If a unit is found to be dense in k – 1 dimension, then it is considered for finding
dense unit in k dimensions.
d. If grid boundaries are strictly followed to separate objects, accuracy of clustering
result is decreased as it may miss neighbouring objects which get separated by string
grid boundary. Clustering quality is highly dependent on input parameters.