0% found this document useful (0 votes)
14 views37 pages

Unsupervised Learning Update

This document provides an overview of unsupervised learning and the k-means clustering algorithm. It defines unsupervised learning as exploring data to find intrinsic structures without a target variable, while clustering groups similar data instances together. The k-means algorithm is described as partitioning data into k clusters by iteratively assigning points to the closest centroid and recomputing centroids. Strengths include simplicity and efficiency, while weaknesses include sensitivity to outliers, initial seeds and non-spherical clusters.

Uploaded by

Moeketsi Mashigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views37 pages

Unsupervised Learning Update

This document provides an overview of unsupervised learning and the k-means clustering algorithm. It defines unsupervised learning as exploring data to find intrinsic structures without a target variable, while clustering groups similar data instances together. The k-means algorithm is described as partitioning data into k clusters by iteratively assigning points to the closest centroid and recomputing centroids. Strengths include simplicity and efficiency, while weaknesses include sensitivity to outliers, initial seeds and non-spherical clusters.

Uploaded by

Moeketsi Mashigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Unsupervised Learning

Road map
 Basic concepts
 Hierarchical clustering
 K-means algorithm
 Representation of clusters
 Distance functions

2
Supervised learning vs. unsupervised
learning
 Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
 These patterns are then utilized to predict the
values of the target attribute in future data
instances.
 Unsupervised learning: The data have no
target attribute.
 We want to explore the data to find some intrinsic
structures in them.
3
Clustering
 Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
 it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters.
 Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
 Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
 In fact, association rule mining is also unsupervised

4
An illustration
 The data set has three natural groups of data points,
i.e., 3 natural clusters.

CS583, Bing Liu, UIC 5


What is clustering for?
 Let us see some real-life examples
 Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.
 Example 2: In marketing, segment customers
according to their similarities
 To do targeted marketing.

6
What is clustering for? (cont…)
 Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
 To produce a topic hierarchy
 In fact, clustering is one of the most utilized
data mining techniques.
 It has a long history, and used in almost every
field, e.g., medicine, psychology, botany,
sociology, biology, archeology, marketing,
insurance, libraries, etc.
 In recent years, due to the rapid increase of online
documents, text clustering becomes important.
7
K-means clustering
 K-means is a partitional clustering algorithm
 Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X  Rr, and r is the number of attributes
(dimensions) in the data.
 The k-means algorithm partitions the given
data into k clusters.
 Each cluster has a cluster center, called centroid.
 k is specified by the user

8
K-means algorithm
 Given k, the k-means algorithm works as
follows:
1) Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).

9
K-means algorithm – (cont …)

10
Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE  
j 1
xC j
dist (x, m j ) 2 (1)
 Ci is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.
11
An example

+
+

12
An example (cont …)

13
Strengths of k-means
 Strengths:
 Simple: easy to understand and to implement
 Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
 Since both k and t are small. k-means is considered a
linear algorithm.
 K-means is the most popular clustering algorithm.
 Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.

14
Weaknesses of k-means
 The algorithm is only applicable if the mean is
defined.
 For categorical data, k-mode - the centroid is
represented by most frequent values.
 The user needs to specify k.
 The algorithm is sensitive to outliers
 Outliers are data points that are very far away from
other data points.
 Outliers could be errors in the data recording or
some special data points with very different values.

15
Weaknesses of k-means: Problems with
outliers

CS583, Bing Liu, UIC 16


Weaknesses of k-means: To deal with
outliers
 One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
 To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
 Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
 Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

CS583, Bing Liu, UIC 17


Weaknesses of k-means (cont …)
 The algorithm is sensitive to initial seeds.

CS583, Bing Liu, UIC 18


Weaknesses of k-means (cont …)
 If we use different seeds: good results
 There are some
methods to help
choose good
seeds

CS583, Bing Liu, UIC 19


Weaknesses of k-means (cont …)
 The k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids (or
hyper-spheres).

CS583, Bing Liu, UIC 20


K-means summary
 Despite weaknesses, k-means is still the
most popular algorithm due to its simplicity,
efficiency and
 other clustering algorithms have their own lists of
weaknesses.
 No clear evidence that any other clustering
algorithm performs better in general
 although they may be more suitable for some
specific types of data or applications.
 Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!
CS583, Bing Liu, UIC 21
K Map example

 K- mean is an unsupervised machine


learning technique that allow us to cluster
data points.
 This enables us to find patterns in the data
that can help us analyse it more effectively.
 K- means is an iterative algorithm, which
means that it will converge to the optimal
clustering over time.

CS583, Bing Liu, UIC 22


To run a k-means clustering

1. Specify the number of clusters you want (Usually


refered to as K)
2. Randomly initialise the centroids for each cluster (the
centroid is the data point that is in the centre of the
cluster.
3. Determine which data points belong to which cluster by
finding the closest centroid to each data point
4. Update the centroids based on the geomentric means
of all the data points in the cluster.
5. Run 3 and 4 until the centroid stop changing. Each run
is referred to as an iteration

CS583, Bing Liu, UIC 23


Reading dataset

CS583, Bing Liu, UIC 24


Checking for missing values in our, copying on the columns we are interested in, then display the as given below

CS583, Bing Liu, UIC 25


Implementing the K-means algorithm -

1. Scale the data


2. Initialise random centroids
3. Label each data point (based on the distance away for the centroid
4. Update centroids
5. Repair steps 3 and 4until centroids stop changing

CS583, Bing Liu, UIC 26


1. Scalling-Min max Scalling

CS583, Bing Liu, UIC 27


Use the head to check which player as highest overall rating,
wage etc

CS583, Bing Liu, UIC 28


Initialise random centroids

CS583, Bing Liu, UIC 29


Label each data point according to cluster
centers

CS583, Bing Liu, UIC 30


CS583, Bing Liu, UIC 31
Geometric mean

32
CS583, Bing Liu, UIC 33
CS583, Bing Liu, UIC 34
CS583, Bing Liu, UIC 35
CS583, Bing Liu, UIC 36
Check the players and their attributes

CS583, Bing Liu, UIC 37

You might also like