0% found this document useful (0 votes)
45 views23 pages

Machine Learning: Chapter 2 Clustering

The document discusses machine learning clustering techniques. It covers K-means clustering, including initializing centroids, assigning samples, adjusting centroids, and iterating until convergence. It also discusses choosing the number of clusters K using the elbow method by minimizing distortion score. In addition, it briefly reviews decision trees and random forests, and mentions other clustering methods like hierarchical and spectral clustering.

Uploaded by

Nam Le Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views23 pages

Machine Learning: Chapter 2 Clustering

The document discusses machine learning clustering techniques. It covers K-means clustering, including initializing centroids, assigning samples, adjusting centroids, and iterating until convergence. It also discusses choosing the number of clusters K using the elbow method by minimizing distortion score. In addition, it briefly reviews decision trees and random forests, and mentions other clustering methods like hierarchical and spectral clustering.

Uploaded by

Nam Le Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Machine Learning

Chapter 2 Clustering
Dr. Minhhuy Le
EEE, Phenikaa University
Chapter 2: Decision Tree
1. Decision tree review
2. Clustering intuition
3. K-means algorithm
4. Summary
1. Random Forest Review
Example:
f: <Outlook, Temperature, Humidity, Wind> => PlayTennis?

Minhhuy Le, ICSLab, Phenikaa Uni. 3


1. Random Forest Review
Example:

Minhhuy Le, ICSLab, Phenikaa Uni. 4


1. Random Forest Review
ID3 approach: Natural greedy approach to growing a decision tree top-down (from the
root to the leaves by repeatedly replacing an existing leaf with an internal node.).

Algorithm:
• Pick “best” attribute to split at the root based on training data.
• Recurse on children that are impure (e.g, have both Yes and No)

Key question: Which attribute is best?

Minhhuy Le, ICSLab, Phenikaa Uni. 5


1. Random Forest Review
ID3 approach: Select attribute with highest information gain (IG)

Information Gain of A is the expected reduction in entropy of target variable Y for


data sample S, due to sorting on variable A

Minhhuy Le, ICSLab, Phenikaa Uni. 6


1. Random Forest Review
ID3 approach: Select attribute with highest information gain (IG)

Minhhuy Le, ICSLab, Phenikaa Uni. 7


1. Random Forest Review
ID3 approach: Select attribute with highest information gain (IG)

Minhhuy Le, ICSLab, Phenikaa Uni. 8


1. Random Forest Review
ID3 steps:

1. Calculate Entropy of one attribute


2. Calculate Entropy of each feature (IG)
3. Choose largest IG as Root Node
4. Entropy = 0 -> Leaf, ≠0 will be splitting
5. Repeat until all data classified

Hyper parameters

Minhhuy Le, ICSLab, Phenikaa Uni. 9


1. Random Forest Review
Random Forest Steps:

1. Select random samples from a given dataset.

2. Construct a decision tree for each sample and get a


prediction result from each decision tree.

3. Perform a vote for each predicted result.

4. Select the prediction result with the most votes as the


final prediction.

Minhhuy Le, ICSLab, Phenikaa Uni. 10


2. Clustering Intuition
Supervised learning Unsupervised learning

Training set:
Training set:

No label data (y)

Minhhuy Le, ICSLab, Phenikaa Uni. 11


2. Clustering Intuition

Minhhuy Le, ICSLab, Phenikaa Uni. 12


2. Clustering Intuition
Clustering: Finding structure in the data
• By isolating groups of examples that are similar in some well-defined sense
• Unsupervised learning algorithm: only input data, no label information

How many classes is the best? Depend on the measure of similarity (or distance) between the
data points to be clustered

Minhhuy Le, ICSLab, Phenikaa Uni. 13


3. K-means algorithm
Clustering methods:
• Hierarchical clustering methods
• Spectral clustering
• Semi-supervised clustering
• Clustering by dynamics
• Flat clustering methods: k-means clustering
• Etc.

Minhhuy Le, ICSLab, Phenikaa Uni. 14


3. K-means algorithm
Procedure:
1. Pick k arbitrary centroids (cluster means)
2. Assign each sample to its closest centroid
3. Adjust the centroids to be the means of the
examples assigned to them
4. Repeat step 2 until no change

K-means algorithm is guaranteed to converge in


a finite number of iterations

Minhhuy Le, ICSLab, Phenikaa Uni. 15


3. K-means algorithm

Minhhuy Le, ICSLab, Phenikaa Uni. 16


3. K-means algorithm
Procedure illustration:

Minhhuy Le, ICSLab, Phenikaa Uni. 17


3. K-means algorithm
Procedure illustration:

Minhhuy Le, ICSLab, Phenikaa Uni. 18


3. K-means algorithm
Procedure illustration:

Minhhuy Le, ICSLab, Phenikaa Uni. 19


3. K-means algorithm

Minhhuy Le, ICSLab, Phenikaa Uni. 20


3. K-means algorithm
How to choose k?

Could repeat several times to get the best solution

Minhhuy Le, ICSLab, Phenikaa Uni. 21


3. K-means algorithm
How to choose k? “Elbow” method
• As k is large (smaller clusters), J is smaller however the model is easy overfitting
• k should be chosen at the “elbow” where J is not significantly reduced

J: distortion score

Minhhuy Le, ICSLab, Phenikaa Uni. 22


5. Summary

• K-means is a parametric method, where the parameters are the prototypes.


• Inflexible; the decision boundary is linear.
• Fast! The update steps can be parallelized.
• There are several variations on the basic K-means algorithm
1. K-means++ gives a more specific way to initialize clusters
2. K-medoids chooses the centermost datapoint in the cluster as the prototype instead
of the centroid. (The centroid may not correspond to a datapoint.)

Minhhuy Le, ICSLab, Phenikaa Uni. 23

You might also like