0% found this document useful (0 votes)
46 views11 pages

Week 8. GMM

The document discusses Gaussian mixture models (GMM) for clustering data. GMM extends K-means clustering by using a probabilistic model assuming that data points are generated from a mixture of Gaussian distributions with unknown parameters. GMM uses the Expectation-Maximization algorithm to estimate the parameters, assigning soft membership probabilities rather than hard assignments. The document provides examples of using GMM for clustering in Python and compares its advantages over K-means, such as being able to model non-spherical clusters and distributions.

Uploaded by

revaldianggara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views11 pages

Week 8. GMM

The document discusses Gaussian mixture models (GMM) for clustering data. GMM extends K-means clustering by using a probabilistic model assuming that data points are generated from a mixture of Gaussian distributions with unknown parameters. GMM uses the Expectation-Maximization algorithm to estimate the parameters, assigning soft membership probabilities rather than hard assignments. The document provides examples of using GMM for clustering in Python and compares its advantages over K-means, such as being able to model non-spherical clusters and distributions.

Uploaded by

revaldianggara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Gaussian Mixture Models

Jasman Pardede
Pendahuluan

 Clustering merupakan bagian penting pada data analysis


 K-Means clustering merupakan teknik clustering yang paling sederhana dan mudah
dipahami. K-means clustering sangat cocok untuk data yang sederhana.
 K-means clustering hard assignments, setiap titik ditentukan berdasarkan pusat
cluster. Permasalahan: bagaimana menentukan jumlah cluster, apakah clusternya
secara aktual tidak overlap, bagaimana penangan data yang lebih tersebar.
 K-means clustering merupakan non-probabilistic sehingga kinerjanya tidak baik pada
berbagai situasi real-word.
 Gaussian mixture models (GMM) merupakan perluasan dari K-means.
 import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

K-Means from sklearn.datasets.samples_generator import


make_blobs
from sklearn.cluster import KMeans

def main():
X, y_true = make_blobs(n_samples=400, centers=4,
cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting
print(X)
kmeans = KMeans(4, random_state=0)
labels = kmeans.fit(X).predict(X)
centroids = kmeans.cluster_centers_
print("Centroid")
print(centroids)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40,
cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red',
s=50)
plt.show()
Centroid
[[ 2.84849883 -1.61366997] [ 7.75608144 if __name__ == "__main__":
-1.2689694 ] [ 0.83945671 1.95662677] main()
[ 4.36874542 0.95041055]]
Plot circular

 from scipy.spatial.distance import cdist


 def plot_kmeans(kmeans, X, n_clusters=4, rseed=0,
ax=None):
labels = kmeans.fit_predict(X)

# plot the input data


ax = ax or plt.gca()
ax.axis('equal')
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis',
zorder=2)

# plot the representation of the KMeans model


centers = kmeans.cluster_centers_
radii = [cdist(X[labels == i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3,
alpha=0.5, zorder=1))
import matplotlib.pyplot as plt

K-means (Non- import seaborn as sns; sns.set()


import numpy as np
from sklearn.datasets.samples_generator import make_blobs

circular ~ poor
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):

fit)
labels = kmeans.fit_predict(X)

# plot the input data


ax = ax or plt.gca()
ax.axis('equal')
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)

# plot the representation of the KMeans model


centers = kmeans.cluster_centers_
radii = [cdist(X[labels == i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))

def main():

X, y_true = make_blobs(n_samples=400, centers=4,


cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting

rng = np.random.RandomState(13)
X_stretched = np.dot(X, rng.randn(2, 2))

print(X)
kmeans = KMeans(4, random_state=0)
labels = kmeans.fit(X).predict(X)
centroids = kmeans.cluster_centers_
print("Centroid")
print(centroids)
plot_kmeans(kmeans, X_stretched)
plt.show()

if __name__ == "__main__":
main()
GMM (Gaussian Mixture Models)

 Berusaha untuk mencari model terbaik dari input dataset yang diberikan berdasarkan
sebaran multi-dimensional Gaussian.
 Secara sederhana, GMM dapat digunakan untuk mencari cluster yang sama dengan k-
means.
 GMM dalam mencari model terbaiknya menggunakan Expectation-Maximization (EM)
Kelebihan EM

 Algoritma EM lebih stabil secara numerik, dimana dalam setiap iterasinya loglikelihood-nya naik.
 Dibawah kondisi umum, algoritma EM konvergen terhadap suatu nilai reliabel. Yaitu dengan dimulai
suatu nilai sembarang θ(0) akan hampir selalu konvergen terhadap suatu lokal maximizer, terkecuali
salah dalam mengambil nilai awal θ(0).
 Algoritma EM cenderung mudah diterapkan, karena bersandarkan pada penghitungan complete data.
 Algoritma EM mudah diprogram, karena tidak melibatkan baik integral ataupun turunan dari
likelihood.
 Algoritma EM hanya memakan sedikit ruang harddisk dan memori di komputer karena tidak
menggunakan matriks ataupun invers-nya dalam setiap iterasi.
 Analisis lebih mudah dibanding metode lain.
 Dengan memperhatikan kenaikan monoton likelihood pada iterasi, maka mudah untuk memonitor
konvergensi dan kesalahan program.
 Bisa digunakan untuk mengestimasi nilai dari missing data.
Kekurangan EM

 Tidak menyediakan prosedur untuk menghasilkan estimasi matriks kovarian dari


penduga parameter.
 Algoritma EM bisa saja konvergen secara lambat, yaitu jika terlalu banyak incomplete
information.
 Algoritma EM tidak menjamin akan konvergen pada suatu nilai maksimum global jika
terdapat multipel maksima.
 Dalam beberapa masalah, E step mungkin secara analisis akan degil (intractable).

https://zhiyzuo.github.io/EM/
Tahapan GMM

 Melakukan guess lokasi dan bentuk cluster yang sesuai


 Melakukan tahapa E-step dan M-step sampai diperoleh data yang convergen.
 from sklearn.mixture import GMM
import numpy as np
GMM Cluster from sklearn.datasets.samples_generator
import make_blobs
import matplotlib.pyplot as plt
import seaborn as sns

def main():
sns.set()
print("Contoh GMM")
X, y_true =
make_blobs(n_samples=400, centers=4,
cluster_std=0.60,
random_state=0)
X = X[:, ::-1] # flip axes for better
plotting
gmm = GMM(n_components=4).fit(X)
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=40, cmap='viridis');
plt.show()

if __name__ == "__main__":
main()
GMM plot

 gmm = GMM(n_components=4, covariance_type='full'


random_state=42)
plot_gmm(gmm, X_stretched)

 https://jakevdp.github.io/PythonDataScienceHandbook/05.1
2-gaussian-mixtures.html

You might also like