0% found this document useful (0 votes)
3 views33 pages

Lect 2 Supervised and Unsupervised Learning

The document discusses supervised and unsupervised learning, focusing on clustering as an unsupervised learning technique. It explains the K-means clustering algorithm, its strengths and weaknesses, and highlights the importance of clustering in various fields. Additionally, it covers hierarchical clustering methods and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views33 pages

Lect 2 Supervised and Unsupervised Learning

The document discusses supervised and unsupervised learning, focusing on clustering as an unsupervised learning technique. It explains the K-means clustering algorithm, its strengths and weaknesses, and highlights the importance of clustering in various fields. Additionally, it covers hierarchical clustering methods and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

AIAS

Amity Institute of Applied Sciences

STAT669
Supervised learning vs. unsupervised learning

Dr. Niraj Kr. Singh


1
AIAS

• Supervised learning: discover patterns in


the data that relate data attributes with a
target (class) attribute.
– These patterns are then utilized to predict the
values of the target attribute in future data
instances.
• Unsupervised learning: The data have no
target attribute.
– We want to explore the data to find some
intrinsic structures in them.
AIAS

Clustering
• Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters.
• Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
• Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
• This chapter focuses on clustering.
AIAS

An illustration
• The data set has three natural groups of data points,
i.e., 3 natural clusters.
AIAS

What is clustering for?


• Let us see some real-life examples
• Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
AIAS

• Example 3: Given a collection of text


documents, we want to organize them
according to their content similarities,
– To produce a topic hierarchy
• In fact, clustering is one of the most utilized
data mining techniques.
– It has a long history, and used in almost every field,
e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance, libraries,
etc.
– In recent years, due to the rapid increase of online
documents, text clustering becomes important.
AIAS

Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
AIAS

K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X  Rr, and r is the number of attributes
(dimensions) in the data.
• The k-means algorithm partitions the given data
into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
AIAS

K-means algorithm

• Given k, the k-means algorithm works as


follows:
1) Randomly choose k data points (seeds) to be
the initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).
AIAS

K-means algorithm – (cont …)


AIAS

Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to
different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error
(SSE), (1)
k
SSE  
j 1
xC j
dist (x, m j ) 2

– Ci is the jth cluster, mj is the centroid of cluster Cj


(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
AIAS

An example

+
+
AIAS

An example (cont …)
AIAS

An example: distance function


AIAS

A disk version of k-means


• K-means can be implemented with data on disk
– In each iteration, it scans the data once.
– as the centroids can be computed incrementally
• It can be used to cluster large datasets that do not
fit in main memory
• We need to control the number of iterations
– In practice, a limited is set (< 50).
• Not the best method. There are other scale-up
algorithms, e.g., BIRCH.
AIAS

A disk version of k-means (cont …)


AIAS

Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
AIAS

Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from
other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
AIAS

Weaknesses of k-means: Problems with outliers


AIAS

Weaknesses of k-means: To deal


with outliers
• One method is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data
points, the chance of selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
AIAS

Weaknesses of k-means (cont …)


• The algorithm is sensitive to initial seeds.
AIAS

Weaknesses of k-means (cont …)


• If we use different seeds: good results
There are some
methods to help
choose good
seeds
AIAS

• The k-means algorithm is not suitable for discovering


clusters that are not hyper-ellipsoids (or hyper-
spheres).

+
AIAS

K-means summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity, efficiency
and
– other clustering algorithms have their own lists of
weaknesses.
• No clear evidence that any other clustering
algorithm performs better in general
– although they may be more suitable for some specific
types of data or applications.
• Comparing different clustering algorithms is a
difficult task. No one knows the correct clusters!
AIAS

Common ways to represent clusters

• Use the centroid of each cluster to represent the


cluster.
– compute the radius and
– standard deviation of the cluster to determine its
spread in each dimension

– The centroid representation alone works well if the


clusters are of the hyper-spherical shape.
– If clusters are elongated or are of other shapes,
centroids are not sufficient
AIAS

Using classification model


• All the data points in a
cluster are regarded to
have the same class label,
e.g., the cluster ID.

– run a supervised learning


algorithm on the data to find
a classification model.
AIAS

Use frequent values to represent cluster

• This method is mainly for clustering of


categorical data (e.g., k-modes clustering).
• Main method used in text clustering,
where a small set of frequent words in
each cluster is selected to represent the
cluster.
AIAS

Clusters of arbitrary shapes


• Hyper-elliptical and hyper-spherical
clusters are usually easy to
represent, using their centroid
together with spreads.
• Irregular shape clusters are hard to
represent. They may not be useful
in some applications.
– Using centroids are not suitable (upper
figure) in general
– K-means clusters may be more useful
(lower figure), e.g., for making 2 size
T-shirts.
AIAS

Hierarchical Clustering
• Produce a nested sequence of clusters, a tree,
also called Dendrogram.
AIAS

Types of hierarchical clustering


• Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single
cluster (i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster
is recursively divided further
– stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
AIAS

Agglomerative clustering
It is more popular then divisive methods.
• At the beginning, each data point forms
a cluster (also called a node).
• Merge nodes/clusters that have the least
distance.
• Go on merging
• Eventually all nodes belong to one
cluster
AIAS

Agglomerative clustering algorithm


AIAS

Confusion matrix

You might also like