Unsupervised Learning Clustering Math
Unsupervised Learning Clustering Math
AND CLUSTERING
Jeff Robble, Brian Renzenbrink, Doug Roberts
Unsupervised Procedures
A procedure that uses unlabeled data in its classification process.
Why would we use these?
Collecting and labeling large data sets can be costly
For example, if we have p(x|ωj) ~ N(µj, Σj), where N is the function for
a normal gaussian distribution and θj consists of the components µj
and Σj that characterize the average and variance of the distribution.
1
1 x 1 x 2 (θ1 + θ 2 ) if x = 1
P(x | θ) = θ1 (1− θ1 )1−x + θ 2 (1− θ 2 )1−x =
2 2 1− 1 (θ1 + θ 2 ) if x = 0
2
Suppose we had an unlimited number of samples and use
nonparametric methods to determine p(x|θ) such that P(x=1|θ)=.6
and P(x=0|θ)=.4:
€
Try to solve for θ1 and θ2:
1
(θ1 + θ 2 ) = .6 We discover that the mixture distribution is
2 completely unidentifiable. We cannot infer the
1 individual parameters of θ.
−1− (θ1 + θ 2 ) = .4
2
A mixture density, p(x|θ) is identifiable if we can
-1 + θ1 + θ 2 = .2 recover a unique θ such that p(x|θ) ≠ p(x|θ’).
θ1 + θ 2 = 1.2
Maximum Likelihood Estimates
The posterior probability becomes: p(x k | ω i ,θ i )P(ω i )
P(ω i | x k ,θ) = (6)
p(x k | θ)
We make the following assumptions:
The elements of θi and θj are functionally independent if i ≠ j.
p(D|θ) is a differentiable function of θ, where D = {x1, … , xn} of n
€
independently drawn unlabeled samples.
The search for a maximum value of p(D|θ) extending over θ and P(ωj)
is constrained so that: c
P(ω i ) ≥ 0 i = 1,...,c and ∑ P(ω i ) = 1
i=1
The MLE of the probability of a category is the average over the entire
data set of the estimate derived from each sample (weighted equally)
€
p(x k | ω i , θˆ i ) Pˆ (ω i )
Pˆ (ω i | x k , θˆ ) = c (13)
∑ p(x k | ω j ,θˆ j )Pˆ (ω j )
j=1
€
Maximum Likelihood Estimates
The gradient must vanish at the value of θ i that maximizes the logarithm of the
likelihood, so the MLE θˆ must satsify the following conditions :
i
n
∑ Pˆ (ω i | x k , θˆ )∇ θ i ln p(x k | ω i , θˆ i ) = 0 i = 1,...,c
k=1 (12)
€
Applying MLE to Normal Mixtures
Case 1: The only unknown quantities are the mean vectors .
consists of components of
The likelihood of this particular sample is
where
Applying MLE to Normal Mixtures
If we multiply the above equation by the covariance matrix
and rearranging terms, we obtain the equation for the
maximum likelihood estimate of the mean vector
and
To solve the equation for the MLE, we should again start with
an initial estimate to evaluate Equation 27, and use
Equations 24-26 to update these estimates.
k-Means Clustering
Clusters numerical data in which each cluster has a center
called the mean
The number of clusters c is assumed to be fixed
The goal of the algorithm is to find the c mean vectors µ1,
µ2, …, µc
The number of clusters c
• May be guessed
Where:
b is a free parameter chosen to adjust the “blending” of clusters
b > 1 allows each pattern to belong to multiple clusters (fuzziness)
Fuzzy k-Means
Probabilities of cluster membership for each point are normalized
as
Where:
Fuzzy k-Means
The following is the pseudo code for the Fuzzy k-Means algorithm
Where
is the loglikelihood of D according to the jth model and taken at
the maximum likelihood point
pj is the number of parameters in Mj
The maximum likelihood estimate is