Digital Computer Concept and Practice: Unsupervised Learning
Digital Computer Concept and Practice: Unsupervised Learning
Soohyun Yang
College of Engineering
Department of Civil and Environmental Engineering
Types of ML techniques – All learning is learning!
Our scope
Regression
“Presence of labels” “Absence of labels” “Behavior-driven : feedback loop”
• Advertisement popularity • Recommender systems (YT) • Learning to play games (AlphaGo)
• Spam classification • Clustering
Buying habits (group customers) • Industrial simulation
• Classification
Face recognition • Grouping user logs • Resource management
https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f
Unsupervised learning
Doshi (2020)
https://twitter.com/athena_schools/status/1063013435779223553
Clustering
Aim : To find a natural grouping in data so that items in the same cluster
are more similar to each other than to those from different clusters.
=> High within-cluster similarity & Low inter-cluster similarity
In Unsupervised Learning :
• Work with unlabeled samples
• No needs for splitting the training and test sets
Representative algorithms :
1) K-means clustering (K-평균 군집) => Prototype-based clustering
2) Agglomerative clustering (병합 군집) => Hierarchical-based clustering
3) DBSCAN (Density-Based Spatial Clustering with Noise) => Density-based clustering
K-means algorithm
Prototype-based clustering :
Each cluster is represented by a prototype, which is the centroid
(average) of features of samples within the same cluster.
Advantages :
• Very easy to implement
• Computationally very efficient compared to other clustering algorithms
Disadvantages :
• Have to specify the number of clusters, k, in advance
• Poor clustering performance can be induced by an inappropriate choice for k
K-means algorithm : Principles
1. Define the number of clusters (k) Example : Randomly generated 12 samples with 2 features
to group the data (here, k=3)
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the d1
data => initial centroids!
3. Calculate (Euclidean) distance
between individual centroids and
other points d0
d2
Feature 1 Feature 2
1.47 0.32
8.40 9.47
2.10 -0.22
K-means algorithm : Principles
This point gets involved in C1,
1. Define the number of clusters (k) because of d1 < d0 < d2.
to group the data (here, k=3)
2. Select k random points within the d1
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
6. Repeat the steps 3-5 until the
centroids are not changed.
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
6. Repeat the steps 3-5 until the
centroids are not changed.
K-means algorithm : Principles
1. Define the number of clusters (k)
to group the data (here, k=3)
2. Select k random points within the
data => initial centroids!
3. Calculate distance between
individual centroids and other points
4. Assign each point to the nearest
centroid
5. Calculate the center of each cluster
and move the centroids => update!
6. Repeat the steps 3-5 until the
centroids are not changed.
Clustering depends on the initial centroids
Let’s take the same 12 samples, but initial centroids differ.
Feature 1 Feature 2
3.50 5.10
10.10 8.10
2.10 7.80
How to better control initial centroids in python?
‘KMeans’ class
‘init’ parameter => K-means++ (default), or random
• k-means++: Smarter initialization of the centroids & quicker convergence.
⇒ 1) Randomly select the first centroid from the data points.
⇒ 2) Compute the distance between each point and the nearest, previously
chosen centroid.
⇒ 3) Select the next centroid as a point having maximum distance from the
nearest centroid
⇒ 4) Repeat steps 2 to 3 until k centroids are determined
• ‘random’: choose n_clusters observations (rows) at random from data for the
initial centroids.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Let’s check the result from the KMeans class
How to find the optimal number of clusters?
>> No perfect solution exists in finding the optimal number of clusters!
1) Elbow method (엘보우 방법) : To identify the number of clusters
where the inertia* begins to increase most rapidly.
(*The sum of squared errors (SSE) between a cluster’s centroid and a sample within
the cluster)
https://www.oreilly.com/library/view/statistics-for-
machine/9781788295758/c71ea970-0f3c-4973-8d3a-b09a7a6553c1.xhtml
How to find the optimal number of clusters? (con’t)
>> No perfect solution exists in finding the optimal number of clusters!
2) Silhouette plot (실루엣 그래프) : To find the number of clusters
where the average silhouette coefficient (s) is closed to 1.
where -1 ≤ s(i) ≤ 1
• The cluster cohesion, 𝑎𝑎(i), as the average distance between an sample, 𝒙𝒙(i) , and all other points
in the same cluster.
=> Greater 𝑎𝑎(i) indicates worse within-cluster similarity of the sample.
• The cluster separation, 𝑏𝑏(i), as the average distance between the sample, 𝒙𝒙(i), and all examples
in the nearest cluster.
=> Greater b(i) indicates better inter-cluster similarity of the sample.
In-class exercise 1: K-means clustering (KMC)
Let’s solve a clustering problem
via the KMC algorithm
1. Data preparation & import :
InClassData_Weight_Dist_USL.csv
In-class exercise 1: KMC (con’t)
2. Conduct the feature scaling
3. Set up the K-means CA
In-class exercise 1: KMC (con’t)
4. Execute the elbow method & the silhouette plot by varying the number
of clusters
In-class exercise 1: KMC (con’t)
5. Identify the optimal k based on the visualized results
Take-home points (THPs)
-
-
-
…