ML UNIT 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

UNIT – 2: UNSUPERVISED LEARNING ALGORITHM

K-MEANS CLUSTERING ALGORITHM


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm. The k-means clustering algorithm mainly performs two tasks:
➢ Determines the best value for K center points or centroids by an iterative process.
➢ Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

How does the K-Means Algorithm Work? The working of the K-Means algorithm is explained in
the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
➢ Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
➢ We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset.

➢ Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids.

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
➢ As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

➢ Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.
➢ We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:

➢ As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

➢ We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:
HIERARCHICAL CLUSTERING
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. In this
algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram. The hierarchical clustering technique has two approaches:
(1) Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.
(2) Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Why hierarchical clustering? As we already have other clustering algorithms such as K-Means
Clustering, then why we need hierarchical clustering? So, as we have seen in the K-means clustering
that there are some challenges with this algorithm, which are a predetermined number of clusters, and
it always tries to create the clusters of the same size.

Agglomerative Hierarchical clustering: The agglomerative hierarchical clustering algorithm is a


popular example of HCA. To group the datasets into clusters, it follows the bottom-up approach. It
means, this algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are merged into a
single cluster that contains all the datasets.

How the Agglomerative Hierarchical clustering Work? The working of the AHC algorithm can be
explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of
clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will
now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will
be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the
below images:

Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the
clusters as per the problem.

Measure for the distance between two clusters: There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering. These measures are called
Linkage methods.
(1) Single Linkage: It is the Shortest Distance between the closest points of the clusters.
(2) Complete Linkage: It is the farthest distance between the two points of two different clusters. It
is one of the popular linkage methods as it forms tighter clusters than single-linkage.

(3) Average Linkage: It is the linkage method in which the distance between each pair of datasets is
added up and then divided by the total number of datasets to calculate the average distance between
two clusters. It is also one of the most popular linkage methods.

(4) Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated.

ASSOCIATION RULE LEARNING


Association rule learning is a type of unsupervised learning technique that checks for the dependency
of one data item on another data item and maps accordingly so that it can be more profitable. It tries
to find some interesting relations or associations among the variables of dataset.

The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Association rule
learning can be divided into three types of algorithms:
➢ Apriori
➢ Eclat
➢ F-P Growth Algorithm

How does Association Rule Learning work? Association rule learning works on the concept of If
and Else Statement, such as

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as
single cardinality.
It is all about creating rules, and if the number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands of data items, there are several
metrics. These metrics are given below:
➢ Support
➢ Confidence
➢ Lift
1. Support: Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:
Frequency means how T here means total no of
many times it appears in rows or total transactions.
the table.
2. Confidence: Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It is the ratio
of the transaction that contains X and Y to the number of records that contain X.

3. Lift: It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of
each other. It has three possible values:
➢ If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each
other.
➢ Lift>1: It determines the degree to which the two item sets are dependent to each other.
➢ Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.

Types of Association Rule Learning: Association rule learning can be divided into three algorithms:
1. Apriori Algorithm: This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm uses a breadth-first search
and Hash Tree to calculate the itemset efficiently.

2. Eclat Algorithm: Eclat algorithm stands for Equivalence Class Transformation. This algorithm
uses a depth-first search technique to find frequent item sets in a transaction database. It performs
faster execution than Apriori Algorithm.

3. F-P Growth Algorithm: The F-P growth algorithm stands for Frequent Pattern, and it is the
improved version of the Apriori Algorithm. It represents the database in the form of a tree structure
that is known as a frequent pattern or tree. The purpose of this frequent tree is to extract the most
frequent patterns.
APRIORI ALGORITHM
The Apriori algorithm uses frequent item sets to generate association rules, and it is designed to work
on the databases that contain transactions. With the help of these association rule, it determines how
strongly or how weakly two objects are connected. This algorithm uses a breadth-first search and
Hash Tree to calculate the itemset associations efficiently.

It is the iterative process for finding the frequent item sets from the large dataset. This algorithm was
given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket analysis
and helps to find those products that can be bought together.

What is Frequent Itemset? Frequent item sets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B are the frequent item sets
together, then individually A and B should also be the frequent itemset.

Steps for Apriori Algorithm: Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database, and select the minimum
support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or selected
support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or
minimum confidence.
Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working: We will understand the apriori algorithm using an example and
mathematical calculation:
Example: Suppose we have the following dataset that has various transactions, and from this dataset,
we need to find the frequent item sets and generate the association rules using the Apriori algorithm:
TID ITEM SETS
T1 A, B
T2 B, D
T3 B, C
T4 A, B, D
T5 A, C
T6 B, C
T7 A, C
T8 A, B, C, E
T9 A, B, C

Given: Minimum Support = 2 and Minimum Confidence = 50%


Solution:
Step-1: Calculating C1 and L1:
In the first step, we will create a table that contains support count (The frequency of each itemset
individually in the dataset) of each itemset in the given dataset. This table is called the Candidate set
or C1.
ITEM SET SUPPORT COUNT
A 6
B 7
C 5
D 2
E 1

Now, we will take out all the item sets that have the greater support count that the Minimum Support
(2). It will give us the table for the frequent itemset L1. Since all the item sets have greater or equal
support count than the minimum support, except the E, so E itemset will be removed.

ITEM SET SUPPORT COUNT


A 6
B 7
C 5
D 2

Step-2: Candidate Generation C2, and L2:


In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the item sets of
L1 in the form of subsets.

After creating the subsets, we will again find the support count from the main transaction table of
datasets, i.e., how many times these pairs have occurred together in the given dataset. So, we will get
the below table for C2:

ITEM SET SUPPORT COUNT


{A, B} 4
{A, C} 4
{A, D} 1
{B, C} 4
{B, D} 2
{C, D} 0

Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2. It will give us
the below table for L2

ITEM SET SUPPORT COUNT


{A, B} 4
{A, C} 4
{B, C} 4
{B, D} 2

Step-3: Candidate generation C3, and L3:


For C3, we will repeat the same two processes, but now we will form the C3 table with subsets of
three item sets together, and will calculate the support count from the dataset. It will give the below
table:
ITEM SET SUPPORT COUNT
{A, B, C} 2
{B, C, D} 1
{A, C, D} 0
{A, B, D} 0

Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So, the L3 will
have only one combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:


To generate the association rules, first, we will create a new table with the possible rules from the
occurred combination {A, B.C}. For all the rules, we will calculate the Confidence using formula sup
(A ^B)/A. After calculating the confidence value for all rules, we will exclude the rules that have less
confidence than the minimum threshold (50%).
Rules Support Confidence
A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%
B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%
A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%
C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%
A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%
B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%
As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C → A,
and A^C → B can be considered as the strong association rules for the given problem.

Advantages of Apriori Algorithm


➢ This is easy to understand algorithm
➢ The join and prune steps of the algorithm can be easily implemented on large datasets.

Disadvantages of Apriori Algorithm


➢ The apriori algorithm works slow compared to other algorithms.
➢ The overall performance can be reduced as it scans the database for multiple times.
➢ The time complexity and space complexity of the apriori algorithm is O(2D), which is very
high. Here D represents the horizontal width present in the database.

FP-GROWTH ALGORITHM
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for mining
the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree
structure for storing compressed and crucial information about frequent patterns named frequent-
pattern tree (FP-tree).

The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The
core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree),
which retains the item set association information.
Working of FP-growth algorithm
➢ First, it compresses the input database creating an FP-tree instance to represent frequent items.
➢ After this first step, it divides the compressed database into a set of conditional databases,
each associated with one frequent pattern.
➢ Finally, each such database is mined separately.

FP-Tree: The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then mapped onto a
path in the FP-tree. This is done until all transactions have been read.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree
is to mine the most frequent pattern. Each node of the FP tree represents an item of the item set. Han
defines the FP-tree as the tree structure given below:
1. One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-
header table.
2. Each node in the item-prefix subtree consists of three fields:
➢ Item-name: registers which item is represented by the node;
➢ Count: the number of transactions represented by the portion of the path reaching the node;
➢ Node-link: links to the next node in the FP-tree carrying the same item name or null if there is
none.
3. Each entry in the frequent-item-header table consists of two fields:
➢ Item-name: as the same to the node;
➢ Head of node-link: a pointer to the first node in the FP-tree carrying the item name.

The below diagram is an example of a best-case scenario that occurs when all transactions have the
same itemset; the size of the FP-tree will be only a single branch of nodes.

The worst-case scenario occurs when every transaction has a unique item set. So, the space needed to
store the tree is greater than the space used to store the original data set because the FP-tree requires
additional space to store pointers between nodes and the counters for each item.
Algorithm by Han: The original algorithm to construct the FP-Tree defined by Han is given below:
Algorithm 1: FP-tree construction
Input: A transaction database DB and a minimum support threshold?
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.

1. The first step is to scan the database to find the occurrences of the item sets in the database. This
step is the same as the first step of Apriori. The count of 1-itemsets in the database is called support
count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, and then
the next itemset with the lower count. It means that the branch of the tree is constructed with
transaction item sets in descending order of count.
4. The next transaction in the database is examined. The item sets are ordered in descending order of
count. If any itemset of this transaction is already present in another branch, then this transaction
branch would share a common prefix to the root. This means that the common itemset is linked to the
new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common node and
new node count are increased by 1 as they are created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, along with
the links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this,
traverse the path in the FP Tree. This path or paths is called a conditional pattern base. A conditional
pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the lowest node
(suffix).
7. Construct a Conditional FP Tree, formed by a count of item sets in the path. The item sets meeting
the threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
Example:
Support threshold=50%, Confidence= 60%
Table 1
Transaction List of items
T1 I1, I2, I3
T2 I2, I3, I4
T3 I4, I5
T4 I1, I2, I4
T5 I1, I2, I3, I5
T6 I1, I2, I3, I4
Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item


Item Count
I1 4
I2 5
I3 4
I4 4
I5 2

Table 3: Sort the itemset in descending order


Item Count
I2 5
I1 4
I3 4
I4 4

Build FP Tree: Let's build the FP tree in the following steps, such as:
1. Considering the root node null.
2. The first scan of Transaction T1: I1, I2, I3 contains
three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as
a child, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is
linked to root, I3 is linked to I2 and I4 is linked to I3.
But this branch would share the I2 node as common as it
is already used in T1.
4. Increment the count of I2 by 1, and I3 is linked as a
child to I2, and I4 is linked as a child to I3. The count is
{I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to
I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is
already linked to the root node. Hence it will be
incremented by 1. Similarly, I1 will be incremented by 1
as it is already linked with I2 in T1, thus {I2:3}, {I1:2},
{I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.

Mining of FP-tree is summarized below:


1. The lowest node item, I5, is not considered as it does not have a min support count. Hence it is
deleted.
2. The next lower node is I4. I4 occurs in 2 branches, {I2, I1, I3: I41}, {I2, I3, I4:1}. Therefore,
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1} this forms the conditional
pattern base.
3. The conditional pattern base is considered a transaction database, and an FP tree is constructed.
This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.
4. This path will generate all combinations of frequent patterns: {I2, I4:2}, {I3, I4:2}, {I2, I3, I4:2}
5. For I3, the prefix path would be: {I2, I1:3}, {I2:1}, this will generate a 2 node FP-tree: {I2:4, I1:3}
and frequent patterns are generated: {I2, I3:4}, {I1:I3:3}, {I2, I1, I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and
frequent patterns are generated: {I2, I1:4}.
Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated
I4 {I2, I1, I3:1}, {I2, I3:1} {I2:2, I3:2} {I2, I4:2}, {I3, I4:2}, {I2, I3, I4:2}
I3 {I2, I1:3}, {I2:1} {I2:4, I1:3} {I2, I3:4}, {I1:I3:3}, {I2, I1, I3:3}
I1 {I2:4} {I2:4} {I2, I1:4}

The diagram given below depicts the conditional FP tree associated with the conditional node I3.

Advantages of FP Growth Algorithm


➢ This algorithm needs to scan the database twice when compared to Apriori, which scans the
transactions for each iteration.
➢ The pairing of items is not done in this algorithm, making it faster.
➢ The database is stored in a compact version in memory.
➢ It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


➢ FP Tree is more cumbersome and difficult to build than Apriori.
➢ It may be expensive.
➢ The algorithm may not fit in the shared memory when the database is large.
Difference between Apriori and FP Growth Algorithm
Apriori FP Growth
Apriori generates frequent patterns by making FP Growth generates an FP-Tree for making
the itemsets using pairings such as single item frequent patterns.
set, double itemset, and triple itemset.
Apriori uses candidate generation where FP-growth generates a conditional FP-Tree for
frequent subsets are extended one item at a time. every item in the data.
Since apriori scans the database in each step, it FP-tree requires only one database scan in its
becomes time-consuming for data where the beginning steps, so it consumes less time.
number of items is larger.
A converted version of the database is saved in A set of conditional FP-tree for every item is
the memory saved in the memory
It uses a breadth-first search It uses a depth-first search.
Kum no. me agr ye ques aaye, to hi krna h ye paper
GAUSSIAN MIXTURE MODELS (GMM) m, otherwise ye complex ques h, bde no. m chod
Dena h ye.
Gaussian mixture models (GMM) are a probabilistic concept used to model real-world data sets.
GMMs are a generalization of Gaussian distributions and can be used to represent any data set that
can be clustered into multiple Gaussian distributions. A Gaussian mixture model can be used for
clustering, which is the task of grouping a set of data points into clusters.

The Gaussian mixture model can be understood as a probabilistic model where Gaussian distributions
are assumed for each group and they have means and covariances which define their parameters.
GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). Another name for
Gaussian distribution is the normal distribution.

GMM algorithm: Clustering algorithms like the Gaussian mixture models in machine learning are
used to organize data by identifying commonalities and distinguishing them from one another. It may
be used to classify consumers into subgroups defined by factors like demographics and buying habits.
This is how the GMM algorithm works:
1. Initialize phase: Gaussian distributions’ parameters should be initialized (means, covariances, and
mixing coefficients).
2. Expectation phase: Determine the likelihood that each data point was created using each of the
Gaussian distributions.
3. Maximization phase: Apply the probabilities found in the expectation step to re-estimate the
Gaussian distribution parameters.
4. Final phase: To achieve convergence of the parameters, repeat steps 2 and 3.
GMM Equation: Suppose there are K clusters (For the sake of simplicity here it is assumed that the
number of clusters is known and it is K). So µ and ∑ are also estimated for each k. Had it been only
one distribution, they would have been estimated by the maximum-likelihood method. But since there
are K such clusters and the probability density are defined as a linear function of densities of all these
K distributions, i.e.

where πk is the mixing coefficient for kth distribution. For estimating the parameters by the maximum
log-likelihood method, compute p(X|µ, ∑, π).

Now define a random variable ¥k(X) such that ¥k(X) = p(k|X).


From Bayes theorem,

Now for the log-likelihood function to be maximum, its derivative of p(X|µ, ∑, π) with respect to
µ, ∑, and π should be zero. So equating the derivative of p(X|µ, ∑, π) with respect to µ to zero and
rearranging the terms,

Similarly taking the derivative with respect to ∑ and π respectively, one can obtain the following
expressions.

And

You might also like