Unit - 1-1
Unit - 1-1
In artificial intelligence, machine learning that takes place in the absence of human supervision is
known as unsupervised machine learning. Unsupervised machine learning models, in contrast
to supervised learning, are given unlabeled data and allow discover patterns and insights on their
own—without explicit direction or instruction.
Unsupervised machine learning analyzes and clusters unlabeled datasets using machine learning
algorithms. These algorithms find hidden patterns and data without any human intervention, i.e., we
don’t give output to our model. The training model has only input parameter values and discovers
the groups or patterns on its own.
Unsupervised Learning
Unsupervised learning works by analyzing unlabeled data to identify patterns and relationships. The
data is not labeled with any predefined categories or outcomes, so the algorithm must find these
patterns and relationships on its own. This can be a challenging task, but it can also be very
rewarding, as it can reveal insights into the data that would not be apparent from a labeled dataset.
Data-set in Figure A is Mall data that contains information about its clients that subscribe to them.
Once subscribed they are provided a membership card and the mall has complete information about
the customer and his/her every purchase. Now using this data and unsupervised learning
techniques, the mall can easily group clients based on the parameters we are feeding in.
The input to the unsupervised learning models is as follows:
• Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data
• Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to the labeled one in the Supervised
approach.
There are mainly 3 types of Algorithms which are used for Unsupervised dataset.
• Clustering
• Dimensionality Reduction
Clustering
Clustering in unsupervised machine learning is the process of grouping unlabeled data into clusters
based on their similarities. The goal of clustering is to identify patterns and relationships in the data
without any prior knowledge of the data’s meaning.
Broadly this technique is applied to group data based on different patterns, such as similarities or
differences, our machine model finds. These algorithms are used to process raw, unclassified data
objects into groups. For example, in the above figure, we have not given output parameter values, so
this technique will be used to group clients based on the input parameters provided by our data.
Association rule learning is also known as association rule mining is a common technique used to
discover associations in unsupervised machine learning. This technique is a rule-based ML technique
that finds out some very useful relations between parameters of a large data set. This technique is
basically used for market basket analysis that helps to better understand the relationship between
different products. For e.g. shopping stores use algorithms based on this technique to find out the
relationship between the sale of one product w.r.t to another’s sales based on customer behavior.
Like if a customer buys milk, then he may also buy bread, eggs, or butter. Once trained well, such
models can be used to increase their sales by planning different offers.
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features in a dataset while
preserving as much information as possible. This technique is useful for improving the performance
of machine learning algorithms and for data visualization. Examples of dimensionality reduction
algorithms includeDimensionality reduction is the process of reducing the number of features in a
dataset while preserving as much information as possible.
• Overfitting: Unsupervised learning algorithms can overfit to the specific dataset used for
training, limiting their ability to generalize to new data.
• Data quality: Unsupervised learning algorithms are sensitive to the quality of the input
data. Noisy or incomplete data can lead to misleading or inaccurate results.
• No labeled data required: Unlike supervised learning, unsupervised learning does not
require labeled data, which can be expensive and time-consuming to collect.
• Can uncover hidden patterns: Unsupervised learning algorithms can identify patterns and
relationships in data that may not be obvious to humans.
• Can be used for a variety of tasks: Unsupervised learning can be used for a variety of
tasks, such as clustering, dimensionality reduction, and anomaly detection.
• Can be used to explore new data: Unsupervised learning can be used to explore new data
and gain insights that may not be possible with other methods.
• Can be sensitive to the quality of the data: Unsupervised learning algorithms can be
sensitive to the quality of the input data. Noisy or incomplete data can lead to misleading or
inaccurate results.
• Fraud detection: Unsupervised learning can be used to detect fraud in financial data by
identifying transactions that deviate from the expected patterns. This can help to prevent
fraud by flagging these transactions for further investigation.
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled
data is data that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised
learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have each image tagged
with either “Elephant” , “Camel”or “Cow.”
Key Points:
• The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).
• The trained machine can then make predictions on new, unlabeled data.
Supervised vs. Unsupervised Machine Learning
Model We can test our model. We can not test our model.
Supervised machine Unsupervised machine
Parameters learning learning
What is Clustering ?
The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we don’t have a
target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan
distance, etc. and then group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming
on the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can
be arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or
not. For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So each data point will either belong to cluster 1 or cluster 2.
A C1
B C2
C C2
D C1
• Soft Clustering: In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For example,
Let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This probability is
calculated for all data points.
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
• Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.
• Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.
• Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.
• Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like
X-rays.
• Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID.
Clustering is effective when it can represent a complicated case with a straightforward cluster
ID. Using the same principle, clustering data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the major and common use
cases of clustering. Moving forward we will be discussing Clustering Algorithms that will help you
perform the above tasks.
At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest
distance, and the density of the data points are a few of the elements that influence cluster
formation. Clustering is the process of determining how related the objects are based on a metric
called the similarity measure. Similarity metrics are easier to locate in smaller sets of features. It gets
harder to create similarity measures as the number of features increases. Depending on the type of
clustering algorithm being utilized in data mining, several techniques are employed to group the data
from the datasets. In this part, the clustering techniques are described. Various types of clustering
algorithms are:
4. Distribution-based Clustering
Partitioning methods are the most easiest clustering algorithms. They group data points on the basis
of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a
predetermined number of clusters, and each cluster is referenced by a vector of values. When
compared to the vector value, the input data variable shows no difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish the number of
clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering
machine learning system starts allocating the data points. Despite this, it is still the most popular
type of clustering. K-means and K-medoids clustering are some examples of this type clustering.
Density-based clustering, a model-based method, finds groups based on the density of data points.
Contrary to centroid-based clustering, which requires that the number of clusters be predefined and
is sensitive to initialization, density-based clustering determines the number of clusters automatically
and is less susceptible to beginning positions. They are great at handling clusters of different sizes
and forms, making them ideally suited for datasets with irregularly shaped or overlapping clusters.
These methods manage both dense and sparse data regions by focusing on local density and can
distinguish clusters with a variety of morphologies.
In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters. Due
to its preset number of cluster requirements and extreme sensitivity to the initial positioning of
centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to
produce spherical or convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-based
techniques by autonomously choosing cluster sizes, being resilient to initialization, and successfully
capturing clusters of various sizes and forms. The most popular density-based clustering algorithm
is DBSCAN.
A method for assembling related data points into hierarchical clusters is called hierarchical clustering.
Each data point is initially taken into account as a separate cluster, which is subsequently combined
with the clusters that are the most similar to form one large cluster that contains all of the data
points.
Think about how you may arrange a collection of items based on how similar they are. Each object
begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a
dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger
clusters after the algorithm examines how similar the objects are to one another. When every object
is in one cluster at the top of the tree, the merging process has finished. Exploring various granularity
levels is one of the fun things about hierarchical clustering. To obtain a given number of clusters, you
can select to cut the dendrogram at a particular height. The more similar two objects are within a
cluster, the closer they are. It’s comparable to classifying items according to their family trees, where
the nearest relatives are clustered together and the wider branches signify more general
connections. There are 2 approaches for Hierarchical clustering:
• Divisive Clustering: It follows a top-down approach, here we consider all data points to be
part one big cluster and then this cluster is divide into smaller groups.
• Agglomerative Clustering: It follows a bottom-up approach, here we consider all data points
to be part of individual clusters and then these clusters are clubbed together to make one big
cluster with all data points.
4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized according to their
propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other)
within the data. The data elements are grouped using a probability-based distribution that is based
on statistical distributions. Included are data objects that have a higher likelihood of being in the
cluster. A data point is less likely to be included in a cluster the further it is from the cluster’s central
point, which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need to specify the clusters a
priori for some algorithms, and primarily the definition of the cluster form for the bulk of algorithms.
There must be at least one tuning or hyper-parameter selected, and while doing so should be simple,
getting it wrong could have unanticipated repercussions. Distribution-based clustering has a definite
advantage over proximity and centroid-based clustering approaches in terms of flexibility, accuracy,
and cluster structure. The key issue is that, in order to avoid overfitting, many clustering methods
only work with simulated or manufactured data, or when the bulk of the data points certainly belong
to a preset distribution. The most popular distribution-based clustering algorithm is Gaussian
Mixture Model.
1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology: It can be used for classification among different species of plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.
7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.
8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.
9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.
10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.
11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak
hours, routes, and speeds, which can help in improving transportation planning and
infrastructure.
15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and
its impact on the environment.
18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.
19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.
Partitioning Method: This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the number of clusters
that has to be generated for the clustering methods. In the partitioning method when database(D)
that contains multiple(N) objects then the partitioning method constructs user-specified(K) partitions
of the data in which each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are K-Mean, PAM(K-
Medoids), CLARA algorithm (Clustering Large Applications) etc. In this article, we will be seeing the
working of K Mean algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from the
user and partitions the dataset containing N objects into K clusters so that resulting similarity among
the data objects inside the group (intracluster) is high but the similarity of data objects with the data
objects from outside the cluster is low (intercluster). The similarity of the cluster is determined with
respect to the mean value of the cluster. It is a type of square error algorithm. At the start randomly
k objects from the dataset are chosen in which each of the objects represents a cluster
mean(centre). For the rest of the data objects, they are assigned to the nearest cluster based on
their distance from the cluster mean. The new mean of each of the cluster is then calculated with the
added data objects.
Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
Example: Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-
66) as 2 clusters we get using K Mean Algorithm.
Hierarchical clustering is a connectivity-based clustering model that groups the data points together
that are close to each other based on the measure of similarity or distance. The assumption is that
data points that are close to each other are more similar or related than data points that are farther
apart.
We can look at the dendrogram and measure the height at which the branches of the dendrogram
form distinct clusters to calculate the ideal number of clusters. The dendrogram can be sliced at this
height to determine the number of clusters.
1. Agglomerative Clustering
2. Divisive clustering
Algorithm :
for i=1 to N:
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
repeat
Steps:
• Consider each alphabet as a single cluster and calculate the distance of one cluster from all
the other clusters.
• In the second step, comparable clusters are merged together to form a single cluster. Let’s
say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the
second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
• We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
• Repeating the same process; The clusters DEF and BC are comparable and merged together
to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
• At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Algorithm :
the cluster is split using a flat clustering method eg. K-Means etc
repeat
While merging two clusters we check the distance between two every pair of clusters and merge the
pair with the least distance/most similarity. But the question is how is that distance determined.
There are different ways of defining Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in squared error when
two clusters are merged.
For example, if we group a given data using different methods, we may get different results:
• Divisive clustering is more efficient if we do not generate a complete hierarchy all the way
down to individual data leaves. The time complexity of a naive agglomerative clustering
is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in
each of N-1 iterations. Using priority queue data structure we can reduce this complexity
to O(n2logn). By using some more optimizations it can be brought down to O(n2). Whereas
for divisive clustering given a fixed number of top levels, using an efficient flat algorithm like
K-Means, divisive algorithms are linear in the number of patterns and clusters.
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that
for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact and
well-separated clusters. Moreover, they are also severely affected by the presence of noise and
outliers in the data.
1. Clusters can be of arbitrary shape such as those shown in the figure below.
1. eps: It defines the neighborhood around a data point i.e. if the distance between two points
is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too
small then a large part of the data will be considered as an outlier. If it is chosen very large
then the clusters will merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the
core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a
neighbor of e, which in turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.
# cluster index
C=1
mark p as visited
# find neighbors
if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) and K-Means are both
clustering algorithms that group together data that have the same characteristic. However, They
work on different principles and are suitable for different types of data. We prefer to use DBSCAN
when the data is not spherical in shape or the number of classes is not known beforehand.
DBSCAN K-Means
Spectral co-clustering is a clustering algorithm that uses spectral graph theory to find clusters in both
rows and columns of a data matrix simultaneously. This is done by constructing a bi-partite graph
from the data matrix, where the rows and columns of the matrix are represented as nodes in the
graph, and the entries in the matrix are represented as edges between the nodes.
The spectral co-clustering algorithm then uses the eigenvectors of the graph Laplacian to find the
clusters in the data matrix. This is done by treating the rows and columns of the data matrix as two
separate sets of nodes and using the eigenvectors to partition each set into clusters.
One advantage of the spectral co-clustering algorithm is that it can handle data with missing entries.
This is because the algorithm only uses the non-zero entries in the data matrix to construct the bi-
partite graph, and therefore does not require the matrix to be complete.
Another advantage of the spectral co-clustering algorithm is that it can find clusters of different sizes
and shapes. This is because the algorithm uses the eigenvectors of the graph Laplacian, which are
sensitive to the local structure of the graph and can therefore identify clusters of different shapes
and sizes.
Analyzing patterns to partition the data samples according to some criteria is called clustering. The
data mining technique which allows simultaneous clustering of the rows and columns of a matrix is
called biclustering. A set of m samples represented by an n-dimensional feature vector, the entire
dataset can be represented as m rows in an n column the biclustering algorithm generates biclusters,
a subset of rows that exhibits similar behavior across a subset of columns. A biclustering of a dataset
is a collection of pairs of sample and feature objects B=(l1, F1);(L2, F2); ……..(Lr, Fr) such that
collection (L1, L2, L3……) forms a partitioning of a set of samples, and collections (F1, F2, F3….) form a
partition of the set of features. A set of (Lk, Fk) will be a bicluster.
Types of Biclusters:
• Biclusters with a constant value: It reorders rows and columns to group similar rows and
columns with similar values, constant. A perfect constant bicluster is a matrix having all
values equal.
• Bicluster with constant values on rows or columns: In these biclusters, rows, and columns
should be normalized.
• Bicluster with coherent values: The subsets of rows or columns will almost have the same
score.
• Additive:
• Multiplicative:
0.3 0.2 10 10
0.3 0.2 10 10
Bi-Partite Graph:
A vertex set divides into two disjoint sets v1,v2 and each edge in the graph joins a vertex in v1 to the
vertex v2.
Row/Column C1 C2 C3 C4
Spectral Co-Clustering:
Takes inputs as a bipartite graph, the data divides into a set of nodes and connected by edges. It finds
biclusters with higher values and rearranges the matrix with higher values along the diagonal
columns.
An = R^-1/2*A*C^-1/2
L = [log2K]
Spectral Biclustering:
It assumes the input matrix has a hidden keyboard structure. In this structure rows and columns are
partitioned so that entries of any bicluster in the cartesian product of row clusters and column
clusters are approximately constants.
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a Market
Based Analysis. Market Based Analysis is one of the key techniques used by large relations to show
associations between items.It allows retailers to identify relationships between the items that people
buy together frequently. Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
Before we start defining the rule, let us first see the basic definitions. Support Count( ) – Frequency
of occurrence of a itemset.
• Support(s) – The number of transactions that include items in the {X} and {Y} parts of the
rule as a percentage of the total number of transaction.It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
• Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as well as
the no of transactions that includes all items in {A} to the no of transactions that includes all
items in {A}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear together more than expected and
less than 1 means they appear less than expected.Greater lift values indicate stronger
association.
= 2/5
= 0.4
= 2/3
= 0.67
= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain groups
of items are consistently purchased together and use this data for adjusting store layouts, cross-
selling, promotions based on statistics.
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a
dataset for boolean association rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative approach or level-wise search
where k-frequent itemsets are used to find k+1 itemsets.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which are explained in my
previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives
us itemset L1.
Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is
that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.
Step-3:
o Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that
it should have (K-2) elements in common. So here, for L2, first element should
match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
o Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every
itemset)
o find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.
Step-4:
o Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4)
is that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So
no itemset in C4
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule
comes into picture. For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.