0% found this document useful (0 votes)
7 views

unsupervised learning

ml

Uploaded by

shaukeenkha3606
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

unsupervised learning

ml

Uploaded by

shaukeenkha3606
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit 2

Unsupervised ml algo
Unsupervised learning types Machine learning
clustering
Association
Dimensionality reduction
 clustering: Clustering is the task of dividing the unlabelled data or data points into different clusters such
that similar data points fall in the same cluster than those which differ from the others.
Or
A way of grouping the data points into different clusters, in which, the clusters consisting of similar data points.
The objects with the possible similarities remain in a group that has less or no similarities with another group."
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use
this id to simplify the processing of large and complex datasets.
applications in the areas of pattern recognition, image analysis, customer analytics, market segmentation,
social network analysis, and more. A broad range of industries use clustering, from airlines to healthcare and
beyond.

Types of clustering methods:


1.Hard clustering: each input data point either belongs to a cluster completely or not.
2.Soft clustering: instead of putting each input data point into a separate cluster, a probability or likelihood of
that data point being in those clusters is assigned.

Clustering algo’s:
 Hierarchical clustering (Connectivity-based Clustering)
 K means clustering
 Fuzzy clustering
 Distribution based clustering
 density based clustering (modal based clustering)
 centroid based clustering (partitioning methods)
 Constraint based (supervised clustering)

Centroid based or partition clustering:


The cluster’s center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.

It works on the closeness of the data points to the chosen central value. The datasets are divided into a given
number of clusters, and a vector of values references every cluster. The input data variable is compared to the
vector value and enters the cluster with minimal difference.
The most common example of partitioning clustering is the K-Means Clustering algorithm.

Density based clustering: The density-based clustering method connects the highly dense areas into
clusters, the structure of cluster is depend on the dense region.
this algo. Identifying different clusters in the dataset and connects the areas of high densities into clusters.
Distribution model-based clustering: the data is divided based on the probability of how a dataset belongs to
a particular distribution. The grouping is done by assuming some distributions commonly Gaussian
Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture
Models (GMM).
It is a probability-based distribution that uses statistical distributions to cluster the data objects. The cluster
includes data objects or points that have a higher probability to be in it. Each cluster has a central point, the
higher the distance of the data point from the central point, that as way low probability to get included in the
cluster.

Hierarchical clustering (Connectivity based clustering): the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram. The observations or any number of clusters can be selected by
cutting the tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy clustering: is a type of soft method in which a data object may belong to more than one group or cluster.
Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a
cluster.
Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-
means algorithm.

Clustering algo:
K-mean clustering:
K-Means Clustering is an unsupervised learning algo, which groups the unlabelled dataset into different
clusters. Here K defines the number of pre-defined clusters. if K=2, there will be two clusters, and for K=3,
there will be three clusters, and so on.
‘’’It is an iterative algorithm that divides the unlabelled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.””
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

Or
K-Means clustering is an unsupervised learning algorithm. There is no labelled data for this clustering, unlike in
supervised learning. K-Means performs the division of objects into clusters that share similarities and are
dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K =
2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

How does k mean clustering work with example?


Step1: select the number k to decide the number of clusters Let's take number k of clusters, i.e., K=2.

Step-2: Select random K points or centroids. (It can be other from the input dataset).
here we are selecting the below two points as k points, which are not the part of our dataset.

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Now we will assign each data point of the scatter plot to its closest K-point or centroid.
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's coluer them as blue and yellow for clear
visualization.

Step-4: Calculate the variance and place a new centroid of each cluster.
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the
new centroids, we will compute the center of gravity of these centroids.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.

we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a
median line.
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.

so we will again go to the step-4, which is finding new centroids or K-points.


Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
we got the new centroids so again will draw the median line and reassign the data points.

We can see in the above image; there are no dissimilar data points on either side of the line, which means our
model is formed.
Step-7: The model is ready. so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:

Advantages of k-means:
Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a
popular choice for clustering tasks.
Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
Scalability: K-means can handle large datasets with a large number of data points and can be easily scaled to
handle even larger datasets.
Flexibility: K-means can be easily adapted to different applications and can be used with different distance
metrics and initialization methods.
Disadvantages of K-Means:
Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a
suboptimal solution.
Requires specifying the number of clusters: The number of clusters k needs to be specified before running the
algorithm, which can be challenging in some applications.
Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting
clusters.
2.Hierarchical clustering algo:
Why need of hierarchical clustering?
in the K-means clustering that there are some challenges or problems such as
 a predetermined number of clusters,
 it always tries to create the clusters of the same size.
To solve these two challenges or problems, we can use the hierarchical clustering algorithm because, in this
algorithm, we don't need to have knowledge about the predefined number of clusters.
What is Hierarchical clustering algo: Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabelled datasets into a cluster and also known as hierarchical cluster
analysis or HCA. we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both differ
depending on how they work. As there is no requirement to predetermine the number of clusters as we did in
the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1. 1.Agglomerative
2. 2.Divisive
1.Agglomerative: To group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair
of clusters together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.
This hierarchy of clusters is represented in the form of the dendrogram.
Working example:
step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters
will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1
clusters.

Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2
clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below
images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters
as per the problem.
Measure of the distance b/w two clusters:
There are various ways to calculate the distance between two clusters the measurement of distance is called
linkage methods.
Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider the below
image:

Complete Linkage: It is the longest distance between the two points of two different clusters.

Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated. Consider the below image:
Working of dendrogram in HC:

The dendrogram is a tree-like structure that is mainly used to store each step as a memory. In the dendrogram
plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis shows all the data points
of the given dataset. The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative clustering, and the
right part is showing the corresponding dendrogram.
As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a rectangular shape. The hight is
decided according to the Euclidean distance between the data points.
In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of
previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.
Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and P6,
in another dendrogram.
At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
Association Rule mining
An association rule is an unsupervised learning method which is used for finding the relationships between
variables or items in the large database. It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people who buy X item (suppose a bread)
are also tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket Analysis,
customer segmentation, fraud detection, social network analysis, Recommendation systems.
Or
is a rule-based Unsupervised Machine Learning technique which can be used to discover interesting relations
between the items/elements/variables/features in a large dataset present in relational databases,
transactional databases or any other data repository. Given a set of transactions,
ARL aims to find the rules which will help us to predict the occurrence of a specific item based on the
occurrences of the other items in the transaction or datasets.
Association Rule mining algo:
 Apriori algo
 FP-growth algo
 ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal)

1.Apriori algorithm: is an Association Rule Learning algorithm which identifies the frequent individual items in
the dataset of the transactional databases. The algorithm’s name is “Apriori” because it uses prior knowledge
of frequent itemset properties. This algorithm applies an iterative approach to mine frequent item-sets, where
k-frequent item-sets are used to find k+1 item-sets.

The algorithm is very useful for effective Market Basket Analysis. It helps the customers to purchase items
easily without any hassle thereby increasing the market sales. or

The apriori algorithm has become one of the most widely used algorithms for frequent itemset mining and
association rule learning. It has been applied to a variety of applications, including market basket analysis,
recommendation systems, and fraud detection.
Algorithm Details: The apriori algorithm starts by setting the minimum support threshold. This is the minimum
number of times an item must occur in the database in order for it to be considered a frequent itemset. The
algorithm then filters out any candidate item-sets that do not meet the minimum support threshold.
The algorithm then generates a list of all possible combinations of frequent item-sets and counts the number
of times each combination appears in the database. The algorithm then generates a list of association rules
based on the frequent itemset combinations.
An association rule is a statement of the form "if item A is present in a transaction, then item B is also likely to
be present". The strength of the association is measured using the confidence of the rule, which is the
probability that item B is present given that item A is present.
Or
Javatpoint
The Apriori algorithm uses frequent item-sets to generate association rules, and it is designed to work on the
databases that contain transactions. With the help of these association rule, it determines how strongly or
how weakly two objects or items are connected. This algorithm uses a breadth-first search and Hash Tree to
calculate the item-set associations efficiently. It is the iterative process for finding the frequent item-sets from
the large dataset.
It is mainly used for market basket analysis and helps to find those products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.
What is frequent itemset?
Frequent item-sets are those items whose support is greater than the threshold value or user-specified
minimum support. It means if A & B are the frequent item-sets together, then individually A and B should also
be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 and 3 are
the frequent item-sets.
Apriori Algorithm has three parts:
1. Support
2. Confidence
3. Lift
Support(I)= (Number of transactions containing item I) / (Total number of transactions)
Confidence (I1 -> I2) = (Number of transactions containing I1 and I2) / (Number of transactions containing I1)
Lift (I1 -> I2) = (Confidence (I1 -> I2) / (Support(I2))
Step-1: Determine the support of item-sets in the transactional database, and select the minimum support
and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or selected support
value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or minimum
confidence.
Step-4: Sort the rules as the decreasing order of lift.
Example: we need to find the frequent item-sets and generate the association rules using the Apriori
algorithm:
TABLE 1
Solution:
Step-1: Calculating C1 or candidate and L1 or lift or itemset:

Remove all the item-sets that have the greater support count that the Minimum Support (2). It will give us the
table for the frequent itemset L1.
Support count < Minimum Support (2).
Than itemset is removed form L1
Support count >= Minimum Support (2).
Not removed
so E itemset will be removed.

THIS IS L1
Step 2: Calculating C2 or candidate and L2 or lift or
itemset: we will generate C2 with the
help of L1. In C2, we will create the pair of the item-sets of
L1 in the form of subsets.

FROM TABLE 1 WE CALCULATE SUPPORT COUNT


AGAIN :
Support count < Minimum Support (2).
Than itemset is removed form L1
Support count >= Minimum Support (2).
Not removed
THE PAIR {C, D}, {A, D} ARE REMOVE FROM C2:

THIS IS L2

Step 3: Calculating C3 or candidate and L3 or lift or itemset:


For C3, we will repeat the same two processes, but now we will form the C3 table with subsets of three item-
sets together, and will calculate the support count from the dataset.

THIS IS C3 WILL BE generate from l2

AGAIN:
Support count < Minimum Support (2).
Than itemset is removed form L1
Support count >= Minimum Support (2).
Not removed
THE PAIR {B, C, D} , {A,C,D}, {A,B,D} ARE REMOVE FROM C3 AND THE TABLE IS KWON IS L3

the L3 will have only one combination, i.e., {A, B, C}.


STEP4: Finding the association rules for the subsets:
To generate the association rules, first, we will create a new table with the possible rules from the occurred
combination {A, B.C}. For all the rules, we will calculate the Confidence using formula sup( A ^B)/A.
After calculating the confidence value for all rules, we will exclude the rules that have less confidence than the
minimum threshold(50%).
Consider the below table:
Rules Support Confidence
A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%
B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%
A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%
C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%
A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%
B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%
As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C → A, and A^C →
B can be considered as the strong association rules for the given problem.
F-P GROWTH ALGO(FREQUENT PATTERN GROWTH ALGO)
the FP-Growth algorithm is also used for frequent pattern mining. The FP-Growth or Frequent Pattern Growth
algorithm is an advancement to the apriori algorithm. While using the apriori algorithm for association rule
mining, we need to scan the transaction dataset multiple times. In the FP growth algorithm, we just need to
scan the dataset twice.
we also don’t need to generate candidate sets while generating the frequent item-sets in apriori algorithm
numerical example. We create an FP-Tree and use it to determine the frequent item-sets. Thus, the FP-Growth
algorithm helps us perform frequent pattern mining with less computing resources and even lesser time.
Example:

The frequency of each individual item is computed:


Let the minimum support be 3.

A Frequent Pattern set is built which will contain all the elements whose frequency is greater than or equal to
the minimum support. These elements are stored in descending order of their respective frequencies.
After insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Ordered-Item set:

Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
Inserting the set {K, E, M, O, Y}:
Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and E, simply the support count is increased by 1. On inserting O we can
see that there is no direct link between E and O, therefore a new node for the item O is initialized with the
support count as 1 and item E is linked to this new node. On inserting Y, we first initialize a new node for the
item Y with support count as 1 and link the new node of O with the new node of Y.

Inserting the set {K, E, M}:


Here simply the support count of each element is increased by 1.
Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y are initialized and
linked accordingly.

Inserting the set {K, E, O}:


Here simply the support counts of the respective elements are increased. Note that the support count of the
new node of item O is increased.
Conditional Pattern Base:

Conditional Frequent Pattern Tree: It is done by taking the set of elements that is common in all the paths in
the Conditional Pattern Base of that item and calculating its support count by summing the support counts
of all the paths in the Conditional Pattern Base.
Frequent Pattern rules:
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the items of
the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the below table.

ky , y k
kO, EO, OE , EK, KE
KM , MK
EK , KE
GAUSSIAN MIXTURE MODEL:
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing the
points in different clusters.
The Gaussian Mixture Model (GMM) is a probabilistic model used for clustering and density estimation.
It assumes that the data is generated from a mixture of several Gaussian distributions, each representing a
distinct cluster. GMM assigns probabilities to data points, allowing them to belong to multiple clusters
simultaneously. The model is widely used in machine learning and pattern recognition applications.
Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian distributions, and each
of these distributions represent a cluster. Hence, a Gaussian Mixture Model tends to group the data points
belonging to a single distribution together.
Gaussian distribution is given by:

where μ is the mean and σ2 is the variance.


The probability density function would be given by:

You might also like