0% found this document useful (0 votes)
8 views21 pages

Digital Computer Concept and Practice: Unsupervised Learning

The document discusses unsupervised learning techniques, focusing on clustering methods such as K-means, agglomerative clustering, and DBSCAN. It details the principles of the K-means algorithm, including its advantages, disadvantages, and the importance of initial centroid selection. Additionally, it outlines methods for determining the optimal number of clusters, such as the elbow method and silhouette plot.

Uploaded by

hanyeelovesgod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

Digital Computer Concept and Practice: Unsupervised Learning

The document discusses unsupervised learning techniques, focusing on clustering methods such as K-means, agglomerative clustering, and DBSCAN. It details the principles of the K-means algorithm, including its advantages, disadvantages, and the importance of initial centroid selection. Additionally, it outlines methods for determining the optimal number of clusters, such as the elbow method and silhouette plot.

Uploaded by

hanyeelovesgod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

035.

001 Spring, 2024

Digital Computer Concept and Practice


Unsupervised Learning (1)

Soohyun Yang

College of Engineering
Department of Civil and Environmental Engineering
Types of ML techniques – All learning is learning!
Our scope

Regression
“Presence of labels” “Absence of labels” “Behavior-driven : feedback loop”
• Advertisement popularity • Recommender systems (YT) • Learning to play games (AlphaGo)
• Spam classification • Clustering
Buying habits (group customers) • Industrial simulation
• Classification
Face recognition • Grouping user logs • Resource management

https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f
Unsupervised learning

Doshi (2020)
https://twitter.com/athena_schools/status/1063013435779223553
Clustering
 Aim : To find a natural grouping in data so that items in the same cluster
are more similar to each other than to those from different clusters.
=> High within-cluster similarity & Low inter-cluster similarity

 In Unsupervised Learning :
• Work with unlabeled samples
• No needs for splitting the training and test sets

 Representative algorithms :
1) K-means clustering (K-평균 군집) => Prototype-based clustering
2) Agglomerative clustering (병합 군집) => Hierarchical-based clustering
3) DBSCAN (Density-Based Spatial Clustering with Noise) => Density-based clustering
K-means algorithm
 Prototype-based clustering :
Each cluster is represented by a prototype, which is the centroid
(average) of features of samples within the same cluster.

 Advantages :
• Very easy to implement
• Computationally very efficient compared to other clustering algorithms
 Disadvantages :
• Have to specify the number of clusters, k, in advance
• Poor clustering performance can be induced by an inappropriate choice for k
K-means algorithm : Principles
1. Define the number of clusters (k) Example : Randomly generated 12 samples with 2 features
to group the data (here, k=3)
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the d1
data => initial centroids!
3. Calculate (Euclidean) distance
between individual centroids and
other points d0
d2

Feature 1 Feature 2
1.47 0.32
8.40 9.47
2.10 -0.22
K-means algorithm : Principles
This point gets involved in C1,
1. Define the number of clusters (k) because of d1 < d0 < d2.
to group the data (here, k=3)
2. Select k random points within the d1
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
6. Repeat the steps 3-5 until the
centroids are not changed.
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
6. Repeat the steps 3-5 until the
centroids are not changed.
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
6. Repeat the steps 3-5 until the
centroids are not changed.
Clustering depends on the initial centroids
 Let’s take the same 12 samples, but initial centroids differ.

Feature 1 Feature 2
3.50 5.10
10.10 8.10
2.10 7.80
How to better control initial centroids in python?
 ‘KMeans’ class
 ‘init’ parameter => K-means++ (default), or random
• k-means++: Smarter initialization of the centroids & quicker convergence.
⇒ 1) Randomly select the first centroid from the data points.
⇒ 2) Compute the distance between each point and the nearest, previously
chosen centroid.
⇒ 3) Select the next centroid as a point having maximum distance from the
nearest centroid
⇒ 4) Repeat steps 2 to 3 until k centroids are determined
• ‘random’: choose n_clusters observations (rows) at random from data for the
initial centroids.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Let’s check the result from the KMeans class
How to find the optimal number of clusters?
>> No perfect solution exists in finding the optimal number of clusters!
 1) Elbow method (엘보우 방법) : To identify the number of clusters
where the inertia* begins to increase most rapidly.
(*The sum of squared errors (SSE) between a cluster’s centroid and a sample within
the cluster)

https://www.oreilly.com/library/view/statistics-for-
machine/9781788295758/c71ea970-0f3c-4973-8d3a-b09a7a6553c1.xhtml
How to find the optimal number of clusters? (con’t)
>> No perfect solution exists in finding the optimal number of clusters!
 2) Silhouette plot (실루엣 그래프) : To find the number of clusters
where the average silhouette coefficient (s) is closed to 1.

where -1 ≤ s(i) ≤ 1

• The cluster cohesion, 𝑎𝑎(i), as the average distance between an sample, 𝒙𝒙(i) , and all other points
in the same cluster.
=> Greater 𝑎𝑎(i) indicates worse within-cluster similarity of the sample.

• The cluster separation, 𝑏𝑏(i), as the average distance between the sample, 𝒙𝒙(i), and all examples
in the nearest cluster.
=> Greater b(i) indicates better inter-cluster similarity of the sample.
In-class exercise 1: K-means clustering (KMC)
 Let’s solve a clustering problem
via the KMC algorithm
 1. Data preparation & import :
InClassData_Weight_Dist_USL.csv
In-class exercise 1: KMC (con’t)
 2. Conduct the feature scaling
 3. Set up the K-means CA
In-class exercise 1: KMC (con’t)
 4. Execute the elbow method & the silhouette plot by varying the number
of clusters
In-class exercise 1: KMC (con’t)
 5. Identify the optimal k based on the visualized results
Take-home points (THPs)
-
-
-
…

You might also like