0% found this document useful (0 votes)

13 views28 pages

Unsupervised Learning Clustering Math

The document discusses unsupervised learning and clustering techniques, highlighting their advantages such as cost-effectiveness in data labeling and adaptability to changing data patterns. It contrasts unsupervised learning with supervised learning, emphasizing its ability to uncover complex models and latent variables. Key methods like k-Means, Fuzzy k-Means, and x-Means clustering are introduced, along with their algorithms and applications in data analysis.

Uploaded by

Noah Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views28 pages

Unsupervised Learning Clustering Math

Uploaded by

Noah Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

UNSUPERVISED LEARNING

AND CLUSTERING
Jeff Robble, Brian Renzenbrink, Doug Roberts
Unsupervised Procedures
A procedure that uses unlabeled data in its classification process.
Why would we use these?
 Collecting and labeling large data sets can be costly

 Occasionally, users wish to group data first and label the

groupings second
 In some applications, the pattern characteristics can change
over time. Unsupervised procedures can handle these
situations.
 Unsupervised procedures can be used to find useful features for
classification
 In some situations, unsupervised learning can provide insight
into the structure of the data that helps in designing a classifier
Unsupervised vs. Supervised
Unsupervised learning can be thought of as finding patterns in the
data above and beyond what would be considered pure
unstructured noise. How does it compare to supervised
learning?

With unsupervised learning it is possible to learn larger and more

complex models than with supervised learning. This is because
in supervised learning one is trying to find the connection
between two sets of observations, while unsupervised learning
tries to identify certain latent variables that caused a single set
of observations.

The difference between supervised learning and unsupervised

learning can be thought of as the difference between
discriminant analysis from cluster analysis.
Mixture Densities
We assume that p(x|ωj) can be represented in a functional form that is
determined by the value of parameter vector θj.

For example, if we have p(x|ωj) ~ N(µj, Σj), where N is the function for
a normal gaussian distribution and θj consists of the components µj
and Σj that characterize the average and variance of the distribution.

We need to find the probability of x for a given ωj and θ, but we don’t

component densities mixing parameters

We make the following assumptions:
€ The samples come from a known number of c classes.
 The prior probabilities P(ωj) for each class are known, j = 1…c.
 The forms for the class-conditional probability densities p(x|ωj,θj)
are known, j = 1…c.
 The values for the c parameter vectors θ1... θc are unknown.
 The category labels are unknown  unsupervised learning.

Consider the following mixture density where x is binary:

1 x 1 x
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x
2 2
Identifiability: Estimate Unknown Parameter Vector

1
1 x 1 x  2 (θ1 + θ 2 ) if x = 1
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x =
2 2 1− 1 (θ1 + θ 2 ) if x = 0
 2
Suppose we had an unlimited number of samples and use
nonparametric methods to determine p(x|θ) such that P(x=1|θ)=.6
and P(x=0|θ)=.4:
€
Try to solve for θ1 and θ2:
1
(θ1 + θ 2 ) = .6 We discover that the mixture distribution is
2 completely unidentifiable. We cannot infer the
 1  individual parameters of θ.
−1− (θ1 + θ 2 ) = .4
 2 
A mixture density, p(x|θ) is identifiable if we can
-1 + θ1 + θ 2 = .2 recover a unique θ such that p(x|θ) ≠ p(x|θ’).
θ1 + θ 2 = 1.2
Maximum Likelihood Estimates
The posterior probability becomes: p(x k | ω i ,θ i )P(ω i )
P(ω i | x k ,θ) = (6)
p(x k | θ)
We make the following assumptions:
 The elements of θi and θj are functionally independent if i ≠ j.
 p(D|θ) is a differentiable function of θ, where D = {x1, … , xn} of n
€
independently drawn unlabeled samples.

The search for a maximum value of p(D|θ) extending over θ and P(ωj)
is constrained so that: c
P(ω i ) ≥ 0 i = 1,...,c and ∑ P(ω i ) = 1
i=1

Let Pˆ (ω i ) be the maximim likelihood estimate for P(ω i ).

Let θˆ i be the maximim € likelihood estimate for θ i .
1 n
If Pˆ (ω i ) ≠ 0 for any i then Pˆ (ω i ) = ∑ Pˆ (ω i | x k , θˆ )
n k=1 (11)
€
Maximum Likelihood Estimates
1 n
Pˆ (ω i ) = ∑ Pˆ (ω i | x k , θˆ )
n k=1 (11)

The MLE of the probability of a category is the average over the entire
data set of the estimate derived from each sample (weighted equally)
€

p(x k | ω i , θˆ i ) Pˆ (ω i )
Pˆ (ω i | x k , θˆ ) = c (13)
∑ p(x k | ω j ,θˆ j )Pˆ (ω j )
j=1

Bayes theorem. When estimating the probability for ω i , the numerator

depends on ˆ and not on the full θˆ .
θ
€ i

€
Maximum Likelihood Estimates
The gradient must vanish at the value of θ i that maximizes the logarithm of the
likelihood, so the MLE θˆ must satsify the following conditions :
i

n
∑ Pˆ (ω i | x k , θˆ )∇ θ i ln p(x k | ω i , θˆ i ) = 0 i = 1,...,c
k=1 (12)

Consider one sample, so n =1. Since we assumed Pˆ ≠ 0, the probability

€is maximized as a function of θ so ∇ ln p(x | ω , θˆ ) = 0. Note that
i θi k i i

ln(1) = 0, so we are trying to find the a value of θˆ i that maximizes p(.).

€
Applying MLE to Normal Mixtures
Case 1: The only unknown quantities are the mean vectors .
consists of components of
The likelihood of this particular sample is

and its derivative is

Thus, according to Equation 8 in the book the MLE estimate

must satisfy:

where
Applying MLE to Normal Mixtures
If we multiply the above equation by the covariance matrix
and rearranging terms, we obtain the equation for the
maximum likelihood estimate of the mean vector

However, we still need to calculate explicitly. If we have a

good initial estimate we can use a hill climbing
procedure to improve our estimates
Applying MLE to Normal Mixtures
Case 2: The mean vector , the covariance matrix ,
and the prior probabilities are all unknown
In this case the maximum likelihood principle only gives singular
solutions. Usually, singular solutions are unusable. However,
if we restrict our attention to the largest of the finite local
maxima of the likelihood function we can still find
meaningful results.
Using , , and derived from Equations 11-13 we
can find the likelihood of using
Applying MLE to Normal Mixtures
The differentiation of the previous equation gives

and

Where is the Kronecker delta, is the pth element of

, is the pth element of , and is the pqth
element of and
Applying MLE to Normal Mixtures
Using the above differentiation along with Equation 12 we can
find the following equations for the MLE of , ,
and
Applying MLE to Normal Mixtures
These equations work where
p(x k | ω i ,θˆi ) Pˆ (ω i )
Pˆ (ω i | x k ,θˆ ) = c
∑ p(x k | ω j ,θˆ)Pˆ (ω j )
j=1

To solve the equation for the MLE, we should again start with
an initial estimate to evaluate Equation 27, and use
Equations 24-26 to update these estimates.
k-Means Clustering
Clusters numerical data in which each cluster has a center
called the mean
The number of clusters c is assumed to be fixed
The goal of the algorithm is to find the c mean vectors µ1,
µ2, …, µc
The number of clusters c
• May be guessed

• Assigned based on the final application

k-Means Clustering
The following pseudo code shows the basic functionality of the k
-Means algorithm

begin initialize n, c, µ1, µ2, …, µc

do classify n samples according to nearest µi
recompute µi
until no change in µi
return µ1, µ2, …, µc
end
k-Means Clustering
Two dimensional example with c = 3
clusters

Shows the initial cluster centers and

their associated Voronoi tesselation

Each of the three Voronoi cells are used

to calculate new cluster centers
Fuzzy k-Means
The algorithm assumes that each sample xj has a fuzzy
membership in a cluster(s)
The algorithm seeks a minimum of a heuristic global cost function

Where:
 b is a free parameter chosen to adjust the “blending” of clusters
 b > 1 allows each pattern to belong to multiple clusters (fuzziness)
Fuzzy k-Means
Probabilities of cluster membership for each point are normalized
as

Cluster centers are calculated using Eq. 32

Where:
Fuzzy k-Means
The following is the pseudo code for the Fuzzy k-Means algorithm

begin initialize n, c, b, µ1, …, µc , , i = 1,…,c; j = 1,…,n

normalize by Eq. 30
do recompute µi by Eq. 32
recompute by Eq. 33
until small change in µi and
return µ1, µ2, …, µc
end
Fuzzy k-means
Illustrates the progress of the algorithm

Means lie near the center during the first

iteration since each point has
non-negligible “membership”

Points near the cluster boundaries can

Have membership in more that one
cluster
x-Means
In k-Means the number of clusters is chosen before the algorithm
is applied

In x-Means the Bayesian information criterion (BIC) is used

globally and locally to find the best number of clusters k

BIC is used globally to choose the best model it encounters and

locally to guide all centroid splits
x-Means
The algorithm is supplied:
 A data set D = {x1, x2, …, xn} containing n objects in d-dimensional
space
 A set of alternative models Mj = {C1, C2, …, Ck} which correspond
to solutions with different values of k
 Posterior probabilities P(Mj | D) are used to score the models
x-Means
The BIC is defined as

Where
 is the loglikelihood of D according to the jth model and taken at
the maximum likelihood point
 pj is the number of parameters in Mj
The maximum likelihood estimate is

Where µ(i) is the centroid associated with xi

x-Means
The point probabilities are

Finally the loglikelihood of the data is

x-Means
Basic functionality of the algorithm
 Given a range for k, [kmin, kmax]
 Start with k = kmin
 Continue to add centroids as needed until kmax is reached
 Centroids are added by splitting some centroids in two according to
BIC
 The centroid set with the best score is used as the final output
References
Duda, R., Hart, P., Stork, D. Pattern Classification, 2nd ed. John Wiley & Sons,
2001.

G. Gan, C. Ma, and J. Wu. Data Clustering Theory, Algorithms, and

Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA.
2007

Samet, H., 2008. K-Nearest Neighbor Finding Using MaxNearestDist. IEEE

Trans. Pattern Anal. Mach. Intell. 30, 2 (Feb. 2008), 243-252.

Yu-Long Qiao, Jeng-Shyang Pan, Sheng-He Sun .Improved partial distance

search for k nearest-neighbor classification. IEEE International Conference on
Multimedia and Expo, 2004. June 2004, 1275 – 1278.

Ghahramani, Z. Unsupervised Learning. Advanced Lectures on Machine

Learning, 2003, 72-112.

Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Fitting A Mixture Distribution To Data
No ratings yet
Fitting A Mixture Distribution To Data
12 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Stats, Mle, and Other Stuff: 1 Sevssd
No ratings yet
Stats, Mle, and Other Stuff: 1 Sevssd
10 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
11 Mle
No ratings yet
11 Mle
26 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
ML Map and Bayseian
No ratings yet
ML Map and Bayseian
35 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
67 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Inf 2
No ratings yet
Inf 2
37 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
No ratings yet
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
12 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
51 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
From Physics To Economics
No ratings yet
From Physics To Economics
19 pages
ML Notes
No ratings yet
ML Notes
4 pages
L6: Parameter Estimation: Parameter Estimation Maximum Likelihood Bayesian Estimation Numerical Examples
No ratings yet
L6: Parameter Estimation: Parameter Estimation Maximum Likelihood Bayesian Estimation Numerical Examples
15 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
DHSCH 10
No ratings yet
DHSCH 10
67 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
14 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet
Industrial Training Report
No ratings yet
Industrial Training Report
33 pages
Deteksi Uang Palsu Rupiah Dengan Menggunakan Metode Deteksi Tepi
No ratings yet
Deteksi Uang Palsu Rupiah Dengan Menggunakan Metode Deteksi Tepi
8 pages
Efficient Data Clustering With Link Approach
No ratings yet
Efficient Data Clustering With Link Approach
8 pages
Design of Fuzzy Rule-Based Classifiers With Semantic Cointension
No ratings yet
Design of Fuzzy Rule-Based Classifiers With Semantic Cointension
17 pages
BM2515 Week 2 Segmentation Slides
No ratings yet
BM2515 Week 2 Segmentation Slides
63 pages
Hardware Sales Forecasting Using Clustering and Machine Learning Approach
No ratings yet
Hardware Sales Forecasting Using Clustering and Machine Learning Approach
11 pages
DSI Detailed Syllabus v10.2
No ratings yet
DSI Detailed Syllabus v10.2
4 pages
A Comprehensive Integration of RFM Analysis, Cluster Analysis, and Classification For B2B Customer Relationship Management
No ratings yet
A Comprehensive Integration of RFM Analysis, Cluster Analysis, and Classification For B2B Customer Relationship Management
12 pages
Paper Dinesh Clustering Techniques
No ratings yet
Paper Dinesh Clustering Techniques
5 pages
DSA Presentation Group 6
No ratings yet
DSA Presentation Group 6
34 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
An Efficient K-Means Clustering Algorithm: Analysis and Implementation
No ratings yet
An Efficient K-Means Clustering Algorithm: Analysis and Implementation
12 pages
Cricket Players Performance Prediction and Evaluation Using Machine Learning Algorithms
No ratings yet
Cricket Players Performance Prediction and Evaluation Using Machine Learning Algorithms
7 pages
Computer Vision ch1
No ratings yet
Computer Vision ch1
80 pages
Detection of Disease in Cotton Leaf Using Artificial Neural Network
No ratings yet
Detection of Disease in Cotton Leaf Using Artificial Neural Network
47 pages
Efficient Data Search and Retrieval in Cloud Assisted Iot Environment
No ratings yet
Efficient Data Search and Retrieval in Cloud Assisted Iot Environment
6 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Machinelearning Algorithm Basics2 NOTES
No ratings yet
Machinelearning Algorithm Basics2 NOTES
72 pages
Wa0000.
No ratings yet
Wa0000.
26 pages
Final 2003
No ratings yet
Final 2003
18 pages
SLIC Superpixels Compared To State-Of-The-Art Superpixel Methods
No ratings yet
SLIC Superpixels Compared To State-Of-The-Art Superpixel Methods
8 pages
Customer Segmentation Using Machine Learning With A Coupon Generator GUI
No ratings yet
Customer Segmentation Using Machine Learning With A Coupon Generator GUI
6 pages
Machine Learning Algorithms - A Review - ART20203995
No ratings yet
Machine Learning Algorithms - A Review - ART20203995
6 pages
Create A Clustering Model With Azure Machine Learning Designer
No ratings yet
Create A Clustering Model With Azure Machine Learning Designer
22 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
AI.5 Machine Learning (21 26)
No ratings yet
AI.5 Machine Learning (21 26)
176 pages
Inline Image Vision Technique For Tires Industry 4.0: Quality and Defect Monitoring in Tires Assembly
No ratings yet
Inline Image Vision Technique For Tires Industry 4.0: Quality and Defect Monitoring in Tires Assembly
4 pages
CA - 605 - MJP Machine Learning Practical Slips
No ratings yet
CA - 605 - MJP Machine Learning Practical Slips
25 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages

Unsupervised Learning Clustering Math

Uploaded by

Unsupervised Learning Clustering Math

Uploaded by

UNSUPERVISED LEARNING

 Occasionally, users wish to group data first and label the

With unsupervised learning it is possible to learn larger and more

The difference between supervised learning and unsupervised

We need to find the probability of x for a given ωj and θ, but we don’t

component densities mixing parameters

Consider the following mixture density where x is binary:

Let Pˆ (ω i ) be the maximim likelihood estimate for P(ω i ).

Bayes theorem. When estimating the probability for ω i , the numerator

Consider one sample, so n =1. Since we assumed Pˆ ≠ 0, the probability

ln(1) = 0, so we are trying to find the a value of θˆ i that maximizes p(.).

and its derivative is

Thus, according to Equation 8 in the book the MLE estimate

However, we still need to calculate explicitly. If we have a

Where is the Kronecker delta, is the pth element of

• Assigned based on the final application

begin initialize n, c, µ1, µ2, …, µc

Shows the initial cluster centers and

Each of the three Voronoi cells are used

Cluster centers are calculated using Eq. 32

begin initialize n, c, b, µ1, …, µc , , i = 1,…,c; j = 1,…,n

Means lie near the center during the first

Points near the cluster boundaries can

In x-Means the Bayesian information criterion (BIC) is used

BIC is used globally to choose the best model it encounters and

Where µ(i) is the centroid associated with xi

Finally the loglikelihood of the data is

G. Gan, C. Ma, and J. Wu. Data Clustering Theory, Algorithms, and

Samet, H., 2008. K-Nearest Neighbor Finding Using MaxNearestDist. IEEE

Yu-Long Qiao, Jeng-Shyang Pan, Sheng-He Sun .Improved partial distance

Ghahramani, Z. Unsupervised Learning. Advanced Lectures on Machine

You might also like