0% found this document useful (0 votes)

14 views37 pages

Unsupervised Learning Update

This document provides an overview of unsupervised learning and the k-means clustering algorithm. It defines unsupervised learning as exploring data to find intrinsic structures without a target variable, while clustering groups similar data instances together. The k-means algorithm is described as partitioning data into k clusters by iteratively assigning points to the closest centroid and recomputing centroids. Strengths include simplicity and efficiency, while weaknesses include sensitivity to outliers, initial seeds and non-spherical clusters.

Uploaded by

Moeketsi Mashigo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views37 pages

Unsupervised Learning Update

Uploaded by

Moeketsi Mashigo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Unsupervised Learning

Road map
 Basic concepts
 Hierarchical clustering
 K-means algorithm
 Representation of clusters
 Distance functions

2
Supervised learning vs. unsupervised
learning
 Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
 These patterns are then utilized to predict the
values of the target attribute in future data
instances.
 Unsupervised learning: The data have no
target attribute.
 We want to explore the data to find some intrinsic
structures in them.
3
Clustering
 Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
 it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters.
 Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
 Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
 In fact, association rule mining is also unsupervised

4
An illustration
 The data set has three natural groups of data points,
i.e., 3 natural clusters.

CS583, Bing Liu, UIC 5

What is clustering for?
 Let us see some real-life examples
 Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.
 Example 2: In marketing, segment customers
according to their similarities
 To do targeted marketing.

6
What is clustering for? (cont…)
 Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
 To produce a topic hierarchy
 In fact, clustering is one of the most utilized
data mining techniques.
 It has a long history, and used in almost every
field, e.g., medicine, psychology, botany,
sociology, biology, archeology, marketing,
insurance, libraries, etc.
 In recent years, due to the rapid increase of online
documents, text clustering becomes important.
7
K-means clustering
 K-means is a partitional clustering algorithm
 Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X  Rr, and r is the number of attributes
(dimensions) in the data.
 The k-means algorithm partitions the given
data into k clusters.
 Each cluster has a cluster center, called centroid.
 k is specified by the user

8
K-means algorithm
 Given k, the k-means algorithm works as
follows:
1) Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).

9
K-means algorithm – (cont …)

10
Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE  
j 1
xC j
dist (x, m j ) 2 (1)
 Ci is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.
11
An example

+
+

12
An example (cont …)

13
Strengths of k-means
 Strengths:
 Simple: easy to understand and to implement
 Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
 Since both k and t are small. k-means is considered a
linear algorithm.
 K-means is the most popular clustering algorithm.
 Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.

14
Weaknesses of k-means
 The algorithm is only applicable if the mean is
defined.
 For categorical data, k-mode - the centroid is
represented by most frequent values.
 The user needs to specify k.
 The algorithm is sensitive to outliers
 Outliers are data points that are very far away from
other data points.
 Outliers could be errors in the data recording or
some special data points with very different values.

15
Weaknesses of k-means: Problems with
outliers

CS583, Bing Liu, UIC 16

Weaknesses of k-means: To deal with
outliers
 One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
 To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
 Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
 Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

CS583, Bing Liu, UIC 17

Weaknesses of k-means (cont …)
 The algorithm is sensitive to initial seeds.

CS583, Bing Liu, UIC 18

Weaknesses of k-means (cont …)
 If we use different seeds: good results
 There are some
methods to help
choose good
seeds

CS583, Bing Liu, UIC 19

Weaknesses of k-means (cont …)
 The k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids (or
hyper-spheres).

CS583, Bing Liu, UIC 20

K-means summary
 Despite weaknesses, k-means is still the
most popular algorithm due to its simplicity,
efficiency and
 other clustering algorithms have their own lists of
weaknesses.
 No clear evidence that any other clustering
algorithm performs better in general
 although they may be more suitable for some
specific types of data or applications.
 Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!
CS583, Bing Liu, UIC 21
K Map example

 K- mean is an unsupervised machine

learning technique that allow us to cluster
data points.
 This enables us to find patterns in the data
that can help us analyse it more effectively.
 K- means is an iterative algorithm, which
means that it will converge to the optimal
clustering over time.

CS583, Bing Liu, UIC 22

To run a k-means clustering

1. Specify the number of clusters you want (Usually

refered to as K)
2. Randomly initialise the centroids for each cluster (the
centroid is the data point that is in the centre of the
cluster.
3. Determine which data points belong to which cluster by
finding the closest centroid to each data point
4. Update the centroids based on the geomentric means
of all the data points in the cluster.
5. Run 3 and 4 until the centroid stop changing. Each run
is referred to as an iteration

CS583, Bing Liu, UIC 23

Reading dataset

CS583, Bing Liu, UIC 24

Checking for missing values in our, copying on the columns we are interested in, then display the as given below

CS583, Bing Liu, UIC 25

Implementing the K-means algorithm -

1. Scale the data

2. Initialise random centroids
3. Label each data point (based on the distance away for the centroid
4. Update centroids
5. Repair steps 3 and 4until centroids stop changing

CS583, Bing Liu, UIC 26

1. Scalling-Min max Scalling

CS583, Bing Liu, UIC 27

Use the head to check which player as highest overall rating,
wage etc

CS583, Bing Liu, UIC 28

Initialise random centroids

CS583, Bing Liu, UIC 29

Label each data point according to cluster
centers

CS583, Bing Liu, UIC 30

CS583, Bing Liu, UIC 31
Geometric mean

32
CS583, Bing Liu, UIC 33
CS583, Bing Liu, UIC 34
CS583, Bing Liu, UIC 35
CS583, Bing Liu, UIC 36
Check the players and their attributes

CS583, Bing Liu, UIC 37

CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
LESSON 4 Assignment Method
No ratings yet
LESSON 4 Assignment Method
4 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Python Lab Manual PDF
0% (1)
Python Lab Manual PDF
39 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
W5 20190214 Group1 Matlabreport
No ratings yet
W5 20190214 Group1 Matlabreport
18 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
No ratings yet
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
36 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Cluster
No ratings yet
Cluster
50 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Unit 6 Unsupervised Learning
No ratings yet
Unit 6 Unsupervised Learning
68 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
No ratings yet
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
95 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Week 5 v1.1 - Unsupervised Learning
No ratings yet
Week 5 v1.1 - Unsupervised Learning
40 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
18 pages
Week 11
No ratings yet
Week 11
49 pages
Clustering
No ratings yet
Clustering
38 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
K Means
No ratings yet
K Means
9 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Week 9
No ratings yet
Week 9
66 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
Clustering
No ratings yet
Clustering
67 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
1 Kmeans
No ratings yet
1 Kmeans
13 pages
Intro To ML Ass
No ratings yet
Intro To ML Ass
3 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Minor Project
No ratings yet
Minor Project
10 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Clustering
No ratings yet
Clustering
125 pages
K Mean
No ratings yet
K Mean
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Clustering
No ratings yet
Clustering
84 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Johnson's Rule Problem
No ratings yet
Johnson's Rule Problem
11 pages
Lecture 15 Unsupervised Clustering
No ratings yet
Lecture 15 Unsupervised Clustering
73 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Week 14 and 15 Machine Learning Unsupervised 2
No ratings yet
Week 14 and 15 Machine Learning Unsupervised 2
25 pages
K Means
No ratings yet
K Means
25 pages
3 Greedy Method New
No ratings yet
3 Greedy Method New
92 pages
Unit 4
No ratings yet
Unit 4
125 pages
DSP Lab 7 Manual
0% (1)
DSP Lab 7 Manual
10 pages
Degree Elevation of A Bézier Curve
No ratings yet
Degree Elevation of A Bézier Curve
2 pages
Gedit Cheat Sheet For Rails Development
100% (6)
Gedit Cheat Sheet For Rails Development
1 page
Linear and Binary Search
No ratings yet
Linear and Binary Search
4 pages
AI Questionaries For Exit Exam Preparation - AI (CS-2015)
No ratings yet
AI Questionaries For Exit Exam Preparation - AI (CS-2015)
9 pages
Course Book
No ratings yet
Course Book
8 pages
Notes
No ratings yet
Notes
6 pages
CHP 2
No ratings yet
CHP 2
38 pages
System Error Codes
No ratings yet
System Error Codes
140 pages
APPC 1.6A WKST Polynomial End Behavior
No ratings yet
APPC 1.6A WKST Polynomial End Behavior
2 pages
Ma3151 (Unit 3)
No ratings yet
Ma3151 (Unit 3)
45 pages
Knapsack Problem
No ratings yet
Knapsack Problem
30 pages
SCIP - Introduction
No ratings yet
SCIP - Introduction
109 pages
Gauss Elimination, Jordan, Siedel, Jacobi
No ratings yet
Gauss Elimination, Jordan, Siedel, Jacobi
4 pages
Versatile Medical Image Denoising Algorithm
No ratings yet
Versatile Medical Image Denoising Algorithm
22 pages
Report
No ratings yet
Report
4 pages
Sorting Algorithm Report
No ratings yet
Sorting Algorithm Report
5 pages
2.4 Graphs Question Paper
No ratings yet
2.4 Graphs Question Paper
17 pages
Chapter 5 (Print)
No ratings yet
Chapter 5 (Print)
11 pages
Toom Cook Polynomials Bodrato
No ratings yet
Toom Cook Polynomials Bodrato
15 pages
SEM5 - ADA - RMSE - Questions Solution1
No ratings yet
SEM5 - ADA - RMSE - Questions Solution1
58 pages
Using A Heuristic Approach To Design Personalized Urban Tourism Itineraries With Hotel Selection
No ratings yet
Using A Heuristic Approach To Design Personalized Urban Tourism Itineraries With Hotel Selection
14 pages
Coding 1
No ratings yet
Coding 1
2 pages
Assignment 3-2
No ratings yet
Assignment 3-2
2 pages
Experiment Lab Report - 3
No ratings yet
Experiment Lab Report - 3
5 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

Unsupervised Learning Update

Uploaded by

Unsupervised Learning Update

Uploaded by

Unsupervised Learning

CS583, Bing Liu, UIC 5

CS583, Bing Liu, UIC 16

CS583, Bing Liu, UIC 17

CS583, Bing Liu, UIC 18

CS583, Bing Liu, UIC 19

CS583, Bing Liu, UIC 20

 K- mean is an unsupervised machine

CS583, Bing Liu, UIC 22

1. Specify the number of clusters you want (Usually

CS583, Bing Liu, UIC 23

CS583, Bing Liu, UIC 24

CS583, Bing Liu, UIC 25

1. Scale the data

CS583, Bing Liu, UIC 26

CS583, Bing Liu, UIC 27

CS583, Bing Liu, UIC 28

CS583, Bing Liu, UIC 29

CS583, Bing Liu, UIC 30

CS583, Bing Liu, UIC 37

You might also like