0% found this document useful (0 votes)
8 views

Machine Learning_Unit 3

Unit 3 covers unsupervised learning techniques including clustering methods like K-means and hierarchical clustering, as well as association rules. K-means clustering groups data into a predetermined number of clusters based on similarity, while hierarchical clustering builds a tree of clusters without needing to predefine the number. Association rule mining identifies relationships and patterns in large datasets, commonly used in market basket analysis.

Uploaded by

apparaokadam250
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Machine Learning_Unit 3

Unit 3 covers unsupervised learning techniques including clustering methods like K-means and hierarchical clustering, as well as association rules. K-means clustering groups data into a predetermined number of clusters based on similarity, while hierarchical clustering builds a tree of clusters without needing to predefine the number. Association rule mining identifies relationships and patterns in large datasets, commonly used in market basket analysis.

Uploaded by

apparaokadam250
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit 3: Unsupervised Learning

Clustering (k-means, hierarchical), dimensionality reduction (PCA), association


rules.

Clustering (k-means, hierarchical)

What is K-means Clustering?


K-means clustering is a simple and widely used unsupervised machine learning
algorithm that iteratively groups a collection of data points into a fixed number
of clusters (k) according to their similarity. The algorithm aims to reduce the
distance between each data point and its corresponding cluster center, also
called the centroid. The algorithm terminates when either the centroids remain
stable, or a maximum number of iterations is achieved. K-means clustering has
various applications, such as data analysis, image segmentation, anomaly
detection, etc.

What is Hierarchical Clustering?


Hierarchical clustering is a type of unsupervised machine learning algorithm
that organizes data points into a hierarchy of clusters based on their similarity
or distance. It is also called hierarchical cluster analysis or HCA. Hierarchical
clustering has two variants: agglomerative and divisive.
Agglomerative clustering: begins with each data point as a separate cluster and
then combines the nearest clusters until there is only one cluster left.
Divisive clustering: begins with all data points in a single cluster and then
divides the cluster repeatedly until each data point has its own cluster.
Hierarchical clustering can be represented by a dendrogram, which is a graph
that illustrates the nested arrangement of clusters and their distances.
Hierarchical clustering has various uses, such as finding patterns, discovering
hierarchies, or detecting outliers in data.
Differences Between K-Means Clustering and Hierarchical Clustering. 0
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm,
which is used to group the unlabelled datasets into a cluster and also known
as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may
look similar, but they both differ depending on how they work. As there is no
requirement to predetermine the number of clusters as we did in the K-Means
algorithm.
The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as
it is a top-down approach.

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means
Clustering, then why we need hierarchical clustering? So, as we have seen
in the K-means clustering that there are some challenges with this algorithm,
which are a predetermined number of clusters, and it always tries to create
the clusters of the same size. To solve these two challenges, we can opt for
the hierarchical clustering algorithm because, in this algorithm, we don't
need to have knowledge about the predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering


algorithm.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of


HCA. To group the datasets into clusters, it follows the bottom-up
approach. It means, this algorithm considers each dataset as a single cluster
at the beginning, and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster
that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?


The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there
are N data points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to
form one cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.

Association rules offer a powerful tool for data analysis, providing


insights into patterns and relationships within large datasets. While they
are a staple in market basket analysis, their application extends across
various domains, offering invaluable insights into customer behaviour and
beyond.

Introduction to Association Rules in Data Mining

Association rule mining is a technique in data mining for discovering


interesting relationships, frequent patterns, associations, or correlations,
between variables in large datasets. It’s widely used in various fields such
as market basket analysis, web usage mining, bioinformatics, and more.
The basic idea is to find rules that predict the occurrence of an item based
on the occurrences of other items in the transaction.

Understanding the Basics

To explain the association rule mining, we can use a simple example of a


grocery store’s transaction data. Let’s start by defining a sample
transaction table and then move on to discuss item sets and association
rules derived from this data.

Imagine a small dataset representing transactions in a grocery store:

Table 1: Sample Transaction Table

In this table, each row represents a transaction (a customer’s purchase),


and each transaction has a unique ID. The ‘Items Purchased’ column lists
the items bought in that transaction.

Concept of Itemset

An ‘item’ is a collection of one or more items found within a dataset. For


example, consider a dataset containing various groceries. An item could
be a combination like {Cheese, Tomato}.

The ‘length’ of an item set is the number of items it contains. Thus,


{Cheese, Tomato} is a 2-itemset.

· Single item itemsets: {Milk}, {Bread}, {Butter}, {Diapers}, {Beer},


{Cola}

· Two-item itemsets: {Milk, Bread}, {Bread, Butter}, {Diapers, Beer}, etc.

· Three-item itemsets: {Milk, Bread, Butter}, {Bread, Diapers, Beer}, etc.

Association Rules
· An association rule is a fundamental concept in data mining that reveals
how items within a dataset are connected. It’s a directive that suggests a
strong, potentially useful relationship between two sets of items.

· These rules are expressed in the form of “If-Then” statements, typically


written as {X} → {Y}, where X and Y are different sets of items.

You might also like