0% found this document useful (0 votes)
11 views18 pages

Chapter 3 p4

Unsupervised machine learning involves training models on unlabeled datasets, primarily using clustering and association algorithms. Clustering groups similar data points together, while association rules identify relationships between variables, aiding in tasks like market analysis. K-Means is a popular clustering algorithm that partitions data into predefined clusters based on proximity to centroids, with advantages in simplicity and efficiency but challenges with sensitivity to outliers and scalability.

Uploaded by

sagarmeravi563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

Chapter 3 p4

Unsupervised machine learning involves training models on unlabeled datasets, primarily using clustering and association algorithms. Clustering groups similar data points together, while association rules identify relationships between variables, aiding in tasks like market analysis. K-Means is a popular clustering algorithm that partitions data into predefined clusters based on proximity to centroids, with advantages in simplicity and efficiency but challenges with sensitivity to outliers and scalability.

Uploaded by

sagarmeravi563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unsupervised Machine Learning

Unsupervised learning is a type of machine learning in which models are trained


using unlabeled dataset and are allowed to act on that data without any
supervision.
Types of Unsupervised Learning Algorithm:
•Clustering:
• Clustering is a method of grouping the objects into clusters
such that objects with most similarities remains into a group
and has less or no similarities with the objects of another
group.
• Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence
of those commonalities.

•Association:
• An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large
database.
• It determines the set of items that occurs together in the
dataset.
• Association rule makes marketing strategy more effective.
Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
Advantages of Unsupervised Learning

• Unsupervised learning is used for more complex


tasks as compared to supervised learning because,
in unsupervised learning, we don't have labeled
input data.
• Unsupervised learning is preferable as it is easy to
get unlabeled data in comparison to labeled data.

Disadvantages of Unsupervised Learning


• Unsupervised learning is intrinsically more difficult
than supervised learning as it does not have
corresponding output.
• The result of the unsupervised learning algorithm
might be less accurate as input data is not labeled,
and algorithms do not know the exact output in
advance.
Clustering in Machine Learning

• Clustering or cluster analysis is a machine learning technique,


which groups the unlabeled dataset. It can be defined as "A
way of grouping the data points into different clusters,
consisting of similar data points. The objects with the
possible similarities remain in a group that has less or
no similarities with another group."
• It does it by finding some similar patterns in the unlabeled
dataset such as shape, size, color, behavior, etc., and divides
them as per the presence and absence of those similar
patterns.
• It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled
dataset.
• The clustering technique is commonly used for statistical
data analysis.
• Example: Let's understand the clustering technique with the
real-world example of Mall: When we visit any shopping mall,
we can observe that the things with similar usage are
grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are
grouped in separate sections, so that we can easily find out
the things. The clustering technique also works in the same
way.

• The clustering technique can be widely used in


various tasks. Some most common uses of this
technique are:
1. Market Segmentation
2. Statistical data analysis
3. Social network analysis
4. Image segmentation
5. Anomaly detection, etc.
The below diagram explains the working of the clustering
algorithm. We can see the different fruits are divided into
several groups with similar properties.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
• Hard Clustering: In hard clustering, each data point either belongs to a cluster
completely or not. For example, in the above example each customer is put into one
group out of the 3 groups.
• Soft Clustering: In soft clustering, instead of putting each data point into a separate
cluster, a probability or likelihood of that data point to be in those clusters is assigned.
For example, from the above scenario each costumer is assigned a probability to be in
either of 3 clusters of the retail store.
Types of clustering algorithms
Connectivity models:
• As the name suggests, these models are based on the notion that the data points closer
in data space exhibit more similarity to each other than the data points lying farther
away.
• These models can follow two approaches. In the first approach, they start with
classifying all data points into separate clusters & then aggregating them as the distance
decreases.
• In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is subjective.
• These models are very easy to interpret but lacks scalability for handling big datasets.
Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models:
• These are iterative clustering algorithms in which the notion of similarity is derived
by the closeness of a data point to the centroid of the clusters.
• K-Means clustering algorithm is a popular algorithm that falls into this category. In
these models, the no. of clusters required at the end have to be mentioned
beforehand, which makes it important to have prior knowledge of the dataset.
These models run iteratively to find the local optima.
Distribution models:
• These clustering models are based on the notion of how probable is it that all data
points in the cluster belong to the same distribution (For example: Normal,
Gaussian). These models often suffer from overfitting.
• A popular example of these models is Expectation-maximization algorithm which
uses multivariate normal distributions.

Density Models:
• These models search the data space for areas of varied density of data points in
the data space.
• It isolates various different density regions and assign the data points within
these regions in the same cluster.
• Popular examples of density models are DBSCAN and OPTICS.
K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used


to solve the clustering problems in machine learning or data science.
In this topic, we will learn what is K-means clustering algorithm, how
the algorithm works, along with the R implementation of k-means
clustering.
What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which


groups the unlabeled dataset into different clusters. Here K defines
the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.
It allows us to cluster the data into different groups and a convenient
way to discover the categories of groups in the unlabeled dataset on
its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.

The k-means clustering algorithm mainly performs two tasks:


•Determines the best value for K center points or centroids by an
iterative process.
•Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is
away from other clusters.
The below diagram explains the working of the K-means
Clustering Algorithm:
•Step 1: Choose groups in the feature plan randomly
•Step 2: Minimize the distance between the cluster center and the different
observations (centroid). It results in groups with observations
•Step 3: Shift the initial centroid to the mean of the coordinates within a
group.
•Step 4: Minimize the distance according to the new centroids. New
boundaries are created. Thus, observations will move from one group to
another
•Repeat until no observation changes groups
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
# Loading data
data(iris)

# Structure
str(iris)
Performing K-Means Clustering on Dataset

# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
# Removing initial label of
# Species from original dataset

iris_1 <- iris[, -5]


# Fitting K-Means clustering Model
# to training dataset

set.seed(240) # Setting seed


kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re

# Cluster identification for


# each observation
kmeans.re$cluster
# Confusion Matrix

cm <- table(iris$Species, kmeans.re$cluster)


cm
# Model Evaluation and visualization
plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster,
main = "K-means with 3 clusters")
## Plotiing cluster centers
kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]

# cex is font size, pch is symbol


points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],
col = 1:3, pch = 8, cex = 3)
## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')
Advantages of K means algorithm
1) simple and easy to understand and to implement
2)K means is the most popular clustering algorithm
3)because it provides easily interpretable clustering result
fast and efficient in terms of computational cost excellent
for pre clustering in comparison to other clustering
algorithm
Disadvantages of K means algorithm
1) the algorithm is only applicable if the mean is
defined
2)for categorical data K mode
i) the centroid is represented by most frequent
values
ii) the algorithm is sensitive to outliers( outliers
are data points that are very far away from other
data points)
3) the algorithms are slow and do not scale to a large
number of data point

You might also like