0% found this document useful (0 votes)
22 views

09.unsupervised Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

09.unsupervised Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter 9.

Unsupervised Learning
Although most of the applications of machine learning algorithms today are based on
supervised learning, The vast majority of the available data is unlabeled. We have the input X ,
but we don't have the labels y .

In this chapter, we will look at two unsupervised learning tasks:

• Clustering: the goal is to group similar instances together into clusters. Clustering is a
great tool for data analysis, customer segmentation, recommender systems, search
engines, image segmentation, semi-supervised learning,..
• Anomaly Detection: This is the task of estimating the probability density function of the
random process that generated the dataset. In anomaly detection, instances located in
very low-density regions are likley to be anomalies.

Clustering
Clustering is the task of identifying groups containing similar objects. Applications of clustering:

• Image similarity search: clustering available images & when a new item is provided by
the user, cluster it using the same algorithm and return the top N centered items.
• Image segmentation: by clustering pixels according to their color, then replacing pixel
colors with the mean of its cluster.
• Data Analysis: it might be useful to cluster the instance and analyse each separately.
• Dimensionality Reduction: by replacing the features with each instance's affinity to each
cluster.
• Anomaly Detection: any instance that have low affinity to all clusters is likely to be an
outlier.
• Semi-supervised Learning: If we have a few labels, we can perform clustering and
propagate the available labels to other instances within the clusters.

Let's Consider the example of unsupervised clustering of the Iris dataset. If we remove its labels,
we can't use classification algorithms. Unsupervised Learning Clustering make use of all
available features to locate clusters and assign all instances to one of them. A trained Gaussian
Mixture Model results in only 5/150 wrongly assigned data points.

There are different types of clustering algorithms and there isn't a universal definition of what a
cluster is.

In this section, we will look at two widely used unsupervised learning clustering algorithms, the
first one is KMeans and the second is DBSCAN.

K-Means
Let's start with makes some data:
from sklearn import datasets
import matplotlib.pyplot as plt

X, y = datasets.make_blobs(n_samples=1000, n_features=2, centers=5,


cluster_std=[0.5, 0.5, 0.5, 1, 1])

plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], s=2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

Let's train a k-means clusterer on this dataset.


from sklearn.cluster import KMeans

k = 5

kmeans = KMeans(n_clusters=k)

y_pred = kmeans.fit_predict(X)

Note that you have to specify the number of clusters to be found.

y_pred is kmeans.labels_

True

We can also take a look at the five centroids the algorithm found:

kmeans.cluster_centers_

array([[ 2.52199713, 9.1888182 ],


[ 8.34208939, -7.04883137],
[ 9.95457856, 9.42558685],
[-0.8520734 , 7.28411512],
[ 4.35515004, -6.69262124]])

We can easily predict the cluster of new instances:

X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])

kmeans.predict(X_new)

array([3, 3, 3, 3], dtype=int32)

By plotting the algorithm's decision boundaries, we get a Voronoi tesselation:

# Plotting decision regions


x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 8))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=20, edgecolor='k')
plt.show()
The vast majority of the instances were clearly assigned to their original cluster.

All KMeans cares about is the distance between instances and the centroids. Instead of assigning
each instance to a cluster (hard clustering) it's better to give a per-instance cluster score (soft
clustering). The score can be the distance between the instance and the centroids (this can also
be a good dimensionality reduction technique).

In sklearn, the transform method measures the distance between each instance and the
centroids.

kmeans.transform(X_new)

array([[ 7.61837099, 12.30738821, 12.4190569 , 5.35237346,


9.72260232],
[ 7.20469249, 10.5080573 , 10.17376543, 6.53913925,
8.79761875],
[ 8.29421021, 15.15328359, 14.46061105, 4.7924139 ,
12.16738011],
[ 8.67368095, 14.82643491, 14.68961738, 5.24417259,
11.77295704]])

KMeans Algorithm
Let's suppose we were given the centroids. We could easily label all the instances in the dataset
by assigning each of them to the cluster with the closest centroid. Conversely, if we were given
all the instance labels, you could easily locate all the centroids by computing the mean of the
instances within each cluster. But we are given neither the labels nor the centroids, so how can
you proceed?

By just picking centroids randomly. By picking k instance at random and using their locations as
centroids, then label the instances, update the centroids, label the instances, update the
centroids, and so on until the centroids stop moving.

The algorithm is guaranteed to converge in a finite number of steps (usually very small) & It will
not oscillate forever.

The computational complexity of the algorithm is generally linear with regard to the number of
instances m , the number of clusters k , and the number of dimensions n . However, this is only
true when the data is comprised of clusters. It has an expenential complexity if the instances are
not structured in clusters.

KMeans is generally one of the fastest clustering algorithms. Even though the algorithm is
guaranteed to converge, it may not converge to the right solution (it may converge to a local
optimum), this depends on the initialization step.

Let's look at how we can mitigate this:

Centroid Initialization Methods


If you happen to know approximately where the centroids should be (using DBSCAN for
example), then you can initialize KMeans properly:

good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])

kmeans = KMeans(n_clusters=5, init=good_init, n_init=1)

Another solution is to run the algorithm multiple times with different random initialization and
keep the best solution, this is controlled by the n_init hyperparameter. Scikit-learn will
keep the best solution for you by running the algorithm n_init times. However, for the
algorithm to know what signifies the word best, it uses a performance metric, that metric is
called the inretia, Which is the mean squared distance between each centroid and its
corresponding instances.

An important improvement to the KMeans algorithm is K-Means++ which introduces a smarter


initialization strategy that tend to select centroids that are distant from one another. This
improvement makes the algorithm much less likely to converge to a sub-optimal solution. The
extra compute is well worth it since it reduces the needed n_init value.

Here is the K-Means++ Initialization Steps:

1. Take 1 centroid c (1 ) taken uniformely-randomly from the dataset.


2
D ( x (i) )
Take a new centroid c (i ), choosing an instance x (i ) with probability D ( x ) is the
m ( i)
2. 2.
∑ D ( x( j ))
j =1
distance between x (i ) and the closest centroid that was already chosen.
3. Repeat the previous step until all k centroids have been chosen.

Accelerated K-Means and Mini-Batch K-Means

Another important improvement to the K-means algorithm was proposed and it considerably
accelerated the algorithm by avoiding many unecessary distance calculations. This was achieved
by exploiting the triangle inequality: The Norm of a Straight line is always the shortest distance
between two points. The algorithm keeps track of lower and upper bounds for distances
between instances and centroids. This Improvement is also implemented within sklearn.

Another paper proposed using mini-batches instead of the whole dataset for each iteration. This
Speeds up the algorithm typically by a factor of three or four and make it possible to cluster
huge datasets that don't fit into memory.

Let's use it in sklearn:

from sklearn.cluster import MiniBatchKMeans

minibatch_kmeans = MiniBatchKMeans(n_clusters=5)

minibatch_kmeans.fit(X)

MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',


init_size=None, max_iter=100, max_no_improvement=10,
n_clusters=5, n_init=3, random_state=None,
reassignment_ratio=0.01, tol=0.0, verbose=0)

If our dataset can't fit in memory, we can use memmap with the partial_fit() method.

The advantage of using MiniBtachKMeans becomes significant when we choose a big k for
clusters. Batching becomes much faster and the performance stays roughly the same:

Finding the Optimal Number of Clusters


We might be thinking that we can just pick the model with the lowest inretia. This poses a
problem because increasing k will always give us a lower inretia.

Let's visualize inretia as a function of k:


As we may see, the inretia drops big when we go from 3 to 4 but then it decreases much more
slowly we continue to increase k . This curve has roughly the shape of an arm and there is an
elbow at k =4 . So we pick 4 .

This method of choosing the optimal number of clusters is rather coarse. A more precise and
computationally expensive appraoch is to use the silhouette score, which is the mean silhouette
coefficient over all the instances.
b− a
An instance silhouette score is equal to ∈ [ −1 ,1 )
ma x ( a , b )
• a : mean distance to other instances in the same cluster.
• b : mean distance to instances in the next closest cluster.
• +1 means the instance is well inside its own cluster and far from other clusters.
• 0 means the instance is sitting on the edge between two clusters.
• −1 means the instance may just be on the wrong cluster.
The silhouette score measure a cluster density score.

Let's compute it using sklearn:

from sklearn.metrics import silhouette_score

silhouette_score(X, minibatch_kmeans.labels_)

0.7217271740816389

Let's compare the silhouette score for different number of clusters:

scores = list()
for k in range(2, 10):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
scores.append(silhouette_score(X, kmeans.labels_))
del(kmeans)

plt.plot(list(range(2, 10)), scores)


plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()
This is a better visualization because it showcases the importance of small number of clusters.
An even more informative visualization is obtained when we plot every instance's silhouette
coefficient, sorted by the cluster they are assigned to and the value of the coefficient.

The dashed line indicates the mean silhouette coefficient:

When most of the instances in a cluster have a lower coefficient than this score then the cluster
is rather bad since this means its instances are much too close to other clusters. 4 and 5 look
fine and it seems like a good idea to use k =5 to get clusters of similar sizes.

Limits of K-Means
KMeans is not perfect, so it is necessary to run the algorithm multiple times to avoid suboptimal
solutions.

Aothor limiting factor of the algorithm is that we need to specify the number of clusters.

KMeans also doesn't behave very well when the clusters have varying sizes, different densities,
or non-spherical shapes.

Depending on the data, different clustering algorithms may perform better (like DBSCAN or
Gaussian Mixtures).

Scaling the inputs is also a must with KMeans.

Let's look at a few ways we can beenfit from clustering:


Using Clustering for Image Segmentation
Image segmentation is the task of partitioning an image into multiple segments. All pixels that
are part of the same object type get assigned to the same segment.

In instance segmentation, all pixels that are part of the same object get assigned to the same
segment. The state of the art in semantic or instance segmentation today is achieved using
complex architectures based on convolutional neural networks.

Here, we are going to do something much simpler, color segmentation. We will simply assign
pixels to the same segment if they have similar color. Application: If we want to assess forest
cover in a satellite image, color segmentation may be enough.

Let's do it:

from matplotlib.image import imread


import os

image = imread(fname=os.path.join("static", "imgs", "butterfly.jpg"))

image.shape

(850, 1280, 3)

plt.figure(figsize=(12, 8))
plt.imshow(image.astype(int))
plt.axis('off')
plt.show()
X = image.reshape((-1, 3))
X.shape

(1088000, 3)

kmeans = KMeans(n_clusters=8).fit(X)

segmented_image = kmeans.cluster_centers_[kmeans.labels_]

segmented_image = segmented_image.reshape(image.shape)

plt.figure(figsize=(12, 8))
plt.imshow(segmented_image.astype(int))
plt.axis('off')
plt.show()
Far away points may correspond to the same cluster because KMeans is not aware of spatial
positioning, its input points correspond to 3D RGB points. It clusters them according to how
much they're close to each other.

Now let's take a look at using clustering for preprocessing:

Using Clustering for Preprocessing


Clustering can be an efficient approach to dimensionality reduction. In particular, as a
preprocessing step before a supervised learning algorithm.

As an example of using clustering for dimensionality reduction, let's tackle the small-digits
dataset:

from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
X.shape, y.shape

((1797, 64), (1797,))

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)


X_train.shape, y_train.shape, X_test.shape, y_test.shape
((1347, 64), (1347,), (450, 64), (450,))

Next, fit a logistic regression model:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='liblinear', multi_class='auto')

log_reg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False,


fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False)

log_reg.score(X_test, y_test)

0.9711111111111111

That's our baseline, 94.2 accuracy.

We create a pipeline that will first cluster the training set into 50 clusters and replace the
images with their distances to these 50 clusters, then apply a logistic regression model:

from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
("kmeans", KMeans(n_clusters=50)),
("log_reg", LogisticRegression(solver='liblinear',
multi_class='auto'))
])

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
steps=[('kmeans',
KMeans(algorithm='auto', copy_x=True, init='k-means+
+',
max_iter=300, n_clusters=50, n_init=10,
n_jobs=None,
precompute_distances='auto',
random_state=None,
tol=0.0001, verbose=0)),
('log_reg',
LogisticRegression(C=1.0, class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None,
penalty='l2', random_state=None,
solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False))],
verbose=False)

pipeline.score(X_test, y_test)

0.9777777777777777

By reducing the dimensionality of the Input, we removed much of the noise and patterns and the
instances were easier to get recognized by the logistic regressor, but we choose the number of
clusters arbitrarily, we can surely do better.

We can use GridSearchCV to find the optimal number of clusters based on the final scoring by
Logistic Regression:

from sklearn.model_selection import GridSearchCV

param_dict = dict(kmeans__n_clusters=range(75,125))

grid_clf = GridSearchCV(pipeline, param_dict, cv=3, verbose=2)

grid_clf.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[CV] kmeans__n_clusters=75 ...........................................

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1


concurrent workers.

[CV] ............................ kmeans__n_clusters=75, total= 0.7s


[CV] kmeans__n_clusters=75 ...........................................

[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.7s


remaining: 0.0s

[CV] ............................ kmeans__n_clusters=75, total= 0.6s


[CV] kmeans__n_clusters=75 ...........................................
[CV] ............................ kmeans__n_clusters=75, total= 0.6s
[CV] kmeans__n_clusters=76 ...........................................
[CV] ............................ kmeans__n_clusters=76, total= 0.6s
[CV] kmeans__n_clusters=76 ...........................................
[CV] ............................ kmeans__n_clusters=76, total= 0.6s
[CV] kmeans__n_clusters=76 ...........................................
[CV] ............................ kmeans__n_clusters=76, total= 0.6s
[CV] kmeans__n_clusters=77 ...........................................
[CV] ............................ kmeans__n_clusters=77, total= 0.6s
[CV] kmeans__n_clusters=77 ...........................................
[CV] ............................ kmeans__n_clusters=77, total= 0.6s
[CV] kmeans__n_clusters=77 ...........................................
[CV] ............................ kmeans__n_clusters=77, total= 0.6s
[CV] kmeans__n_clusters=78 ...........................................
[CV] ............................ kmeans__n_clusters=78, total= 0.6s
[CV] kmeans__n_clusters=78 ...........................................
[CV] ............................ kmeans__n_clusters=78, total= 0.6s
[CV] kmeans__n_clusters=78 ...........................................
[CV] ............................ kmeans__n_clusters=78, total= 0.6s
[CV] kmeans__n_clusters=79 ...........................................
[CV] ............................ kmeans__n_clusters=79, total= 0.6s
[CV] kmeans__n_clusters=79 ...........................................
[CV] ............................ kmeans__n_clusters=79, total= 0.6s
[CV] kmeans__n_clusters=79 ...........................................
[CV] ............................ kmeans__n_clusters=79, total= 0.6s
[CV] kmeans__n_clusters=80 ...........................................
[CV] ............................ kmeans__n_clusters=80, total= 0.7s
[CV] kmeans__n_clusters=80 ...........................................
[CV] ............................ kmeans__n_clusters=80, total= 0.8s
[CV] kmeans__n_clusters=80 ...........................................
[CV] ............................ kmeans__n_clusters=80, total= 1.1s
[CV] kmeans__n_clusters=81 ...........................................
[CV] ............................ kmeans__n_clusters=81, total= 0.8s
[CV] kmeans__n_clusters=81 ...........................................
[CV] ............................ kmeans__n_clusters=81, total= 0.7s
[CV] kmeans__n_clusters=81 ...........................................
[CV] ............................ kmeans__n_clusters=81, total= 0.8s
[CV] kmeans__n_clusters=82 ...........................................
[CV] ............................ kmeans__n_clusters=82, total= 1.1s
[CV] kmeans__n_clusters=82 ...........................................
[CV] ............................ kmeans__n_clusters=82, total= 1.2s
[CV] kmeans__n_clusters=82 ...........................................
[CV] ............................ kmeans__n_clusters=82, total= 0.8s
[CV] kmeans__n_clusters=83 ...........................................
[CV] ............................ kmeans__n_clusters=83, total= 1.1s
[CV] kmeans__n_clusters=83 ...........................................
[CV] ............................ kmeans__n_clusters=83, total= 1.1s
[CV] kmeans__n_clusters=83 ...........................................
[CV] ............................ kmeans__n_clusters=83, total= 0.8s
[CV] kmeans__n_clusters=84 ...........................................
[CV] ............................ kmeans__n_clusters=84, total= 0.7s
[CV] kmeans__n_clusters=84 ...........................................
[CV] ............................ kmeans__n_clusters=84, total= 1.7s
[CV] kmeans__n_clusters=84 ...........................................
[CV] ............................ kmeans__n_clusters=84, total= 0.9s
[CV] kmeans__n_clusters=85 ...........................................
[CV] ............................ kmeans__n_clusters=85, total= 0.7s
[CV] kmeans__n_clusters=85 ...........................................
[CV] ............................ kmeans__n_clusters=85, total= 0.7s
[CV] kmeans__n_clusters=85 ...........................................
[CV] ............................ kmeans__n_clusters=85, total= 0.7s
[CV] kmeans__n_clusters=86 ...........................................
[CV] ............................ kmeans__n_clusters=86, total= 0.7s
[CV] kmeans__n_clusters=86 ...........................................
[CV] ............................ kmeans__n_clusters=86, total= 0.6s
[CV] kmeans__n_clusters=86 ...........................................
[CV] ............................ kmeans__n_clusters=86, total= 0.7s
[CV] kmeans__n_clusters=87 ...........................................
[CV] ............................ kmeans__n_clusters=87, total= 0.8s
[CV] kmeans__n_clusters=87 ...........................................
[CV] ............................ kmeans__n_clusters=87, total= 1.1s
[CV] kmeans__n_clusters=87 ...........................................
[CV] ............................ kmeans__n_clusters=87, total= 1.2s
[CV] kmeans__n_clusters=88 ...........................................
[CV] ............................ kmeans__n_clusters=88, total= 0.9s
[CV] kmeans__n_clusters=88 ...........................................
[CV] ............................ kmeans__n_clusters=88, total= 0.9s
[CV] kmeans__n_clusters=88 ...........................................
[CV] ............................ kmeans__n_clusters=88, total= 0.7s
[CV] kmeans__n_clusters=89 ...........................................
[CV] ............................ kmeans__n_clusters=89, total= 0.7s
[CV] kmeans__n_clusters=89 ...........................................
[CV] ............................ kmeans__n_clusters=89, total= 0.7s
[CV] kmeans__n_clusters=89 ...........................................
[CV] ............................ kmeans__n_clusters=89, total= 1.0s
[CV] kmeans__n_clusters=90 ...........................................
[CV] ............................ kmeans__n_clusters=90, total= 1.1s
[CV] kmeans__n_clusters=90 ...........................................
[CV] ............................ kmeans__n_clusters=90, total= 1.5s
[CV] kmeans__n_clusters=90 ...........................................
[CV] ............................ kmeans__n_clusters=90, total= 1.5s
[CV] kmeans__n_clusters=91 ...........................................
[CV] ............................ kmeans__n_clusters=91, total= 1.0s
[CV] kmeans__n_clusters=91 ...........................................
[CV] ............................ kmeans__n_clusters=91, total= 0.8s
[CV] kmeans__n_clusters=91 ...........................................
[CV] ............................ kmeans__n_clusters=91, total= 1.4s
[CV] kmeans__n_clusters=92 ...........................................
[CV] ............................ kmeans__n_clusters=92, total= 0.8s
[CV] kmeans__n_clusters=92 ...........................................
[CV] ............................ kmeans__n_clusters=92, total= 0.7s
[CV] kmeans__n_clusters=92 ...........................................
[CV] ............................ kmeans__n_clusters=92, total= 0.7s
[CV] kmeans__n_clusters=93 ...........................................
[CV] ............................ kmeans__n_clusters=93, total= 0.7s
[CV] kmeans__n_clusters=93 ...........................................
[CV] ............................ kmeans__n_clusters=93, total= 0.7s
[CV] kmeans__n_clusters=93 ...........................................
[CV] ............................ kmeans__n_clusters=93, total= 0.7s
[CV] kmeans__n_clusters=94 ...........................................
[CV] ............................ kmeans__n_clusters=94, total= 0.9s
[CV] kmeans__n_clusters=94 ...........................................
[CV] ............................ kmeans__n_clusters=94, total= 0.8s
[CV] kmeans__n_clusters=94 ...........................................
[CV] ............................ kmeans__n_clusters=94, total= 0.9s
[CV] kmeans__n_clusters=95 ...........................................
[CV] ............................ kmeans__n_clusters=95, total= 0.8s
[CV] kmeans__n_clusters=95 ...........................................
[CV] ............................ kmeans__n_clusters=95, total= 0.8s
[CV] kmeans__n_clusters=95 ...........................................
[CV] ............................ kmeans__n_clusters=95, total= 0.8s
[CV] kmeans__n_clusters=96 ...........................................
[CV] ............................ kmeans__n_clusters=96, total= 1.5s
[CV] kmeans__n_clusters=96 ...........................................
[CV] ............................ kmeans__n_clusters=96, total= 1.4s
[CV] kmeans__n_clusters=96 ...........................................
[CV] ............................ kmeans__n_clusters=96, total= 1.3s
[CV] kmeans__n_clusters=97 ...........................................
[CV] ............................ kmeans__n_clusters=97, total= 1.4s
[CV] kmeans__n_clusters=97 ...........................................
[CV] ............................ kmeans__n_clusters=97, total= 1.0s
[CV] kmeans__n_clusters=97 ...........................................
[CV] ............................ kmeans__n_clusters=97, total= 1.1s
[CV] kmeans__n_clusters=98 ...........................................
[CV] ............................ kmeans__n_clusters=98, total= 1.1s
[CV] kmeans__n_clusters=98 ...........................................
[CV] ............................ kmeans__n_clusters=98, total= 0.8s
[CV] kmeans__n_clusters=98 ...........................................
[CV] ............................ kmeans__n_clusters=98, total= 0.8s
[CV] kmeans__n_clusters=99 ...........................................
[CV] ............................ kmeans__n_clusters=99, total= 0.7s
[CV] kmeans__n_clusters=99 ...........................................
[CV] ............................ kmeans__n_clusters=99, total= 0.8s
[CV] kmeans__n_clusters=99 ...........................................
[CV] ............................ kmeans__n_clusters=99, total= 0.8s
[CV] kmeans__n_clusters=100 ..........................................
[CV] ........................... kmeans__n_clusters=100, total= 0.8s
[CV] kmeans__n_clusters=100 ..........................................
[CV] ........................... kmeans__n_clusters=100, total= 0.9s
[CV] kmeans__n_clusters=100 ..........................................
[CV] ........................... kmeans__n_clusters=100, total= 0.8s
[CV] kmeans__n_clusters=101 ..........................................
[CV] ........................... kmeans__n_clusters=101, total= 0.9s
[CV] kmeans__n_clusters=101 ..........................................
[CV] ........................... kmeans__n_clusters=101, total= 0.9s
[CV] kmeans__n_clusters=101 ..........................................
[CV] ........................... kmeans__n_clusters=101, total= 0.8s
[CV] kmeans__n_clusters=102 ..........................................
[CV] ........................... kmeans__n_clusters=102, total= 0.9s
[CV] kmeans__n_clusters=102 ..........................................
[CV] ........................... kmeans__n_clusters=102, total= 0.7s
[CV] kmeans__n_clusters=102 ..........................................
[CV] ........................... kmeans__n_clusters=102, total= 0.7s
[CV] kmeans__n_clusters=103 ..........................................
[CV] ........................... kmeans__n_clusters=103, total= 0.7s
[CV] kmeans__n_clusters=103 ..........................................
[CV] ........................... kmeans__n_clusters=103, total= 0.8s
[CV] kmeans__n_clusters=103 ..........................................
[CV] ........................... kmeans__n_clusters=103, total= 0.7s
[CV] kmeans__n_clusters=104 ..........................................
[CV] ........................... kmeans__n_clusters=104, total= 0.7s
[CV] kmeans__n_clusters=104 ..........................................
[CV] ........................... kmeans__n_clusters=104, total= 0.9s
[CV] kmeans__n_clusters=104 ..........................................
[CV] ........................... kmeans__n_clusters=104, total= 0.8s
[CV] kmeans__n_clusters=105 ..........................................
[CV] ........................... kmeans__n_clusters=105, total= 0.8s
[CV] kmeans__n_clusters=105 ..........................................
[CV] ........................... kmeans__n_clusters=105, total= 0.7s
[CV] kmeans__n_clusters=105 ..........................................
[CV] ........................... kmeans__n_clusters=105, total= 1.1s
[CV] kmeans__n_clusters=106 ..........................................
[CV] ........................... kmeans__n_clusters=106, total= 1.1s
[CV] kmeans__n_clusters=106 ..........................................
[CV] ........................... kmeans__n_clusters=106, total= 0.9s
[CV] kmeans__n_clusters=106 ..........................................
[CV] ........................... kmeans__n_clusters=106, total= 1.0s
[CV] kmeans__n_clusters=107 ..........................................
[CV] ........................... kmeans__n_clusters=107, total= 1.4s
[CV] kmeans__n_clusters=107 ..........................................
[CV] ........................... kmeans__n_clusters=107, total= 1.3s
[CV] kmeans__n_clusters=107 ..........................................
[CV] ........................... kmeans__n_clusters=107, total= 0.8s
[CV] kmeans__n_clusters=108 ..........................................
[CV] ........................... kmeans__n_clusters=108, total= 0.7s
[CV] kmeans__n_clusters=108 ..........................................
[CV] ........................... kmeans__n_clusters=108, total= 1.1s
[CV] kmeans__n_clusters=108 ..........................................
[CV] ........................... kmeans__n_clusters=108, total= 1.1s
[CV] kmeans__n_clusters=109 ..........................................
[CV] ........................... kmeans__n_clusters=109, total= 0.8s
[CV] kmeans__n_clusters=109 ..........................................
[CV] ........................... kmeans__n_clusters=109, total= 0.8s
[CV] kmeans__n_clusters=109 ..........................................
[CV] ........................... kmeans__n_clusters=109, total= 0.9s
[CV] kmeans__n_clusters=110 ..........................................
[CV] ........................... kmeans__n_clusters=110, total= 1.7s
[CV] kmeans__n_clusters=110 ..........................................
[CV] ........................... kmeans__n_clusters=110, total= 1.0s
[CV] kmeans__n_clusters=110 ..........................................
[CV] ........................... kmeans__n_clusters=110, total= 0.9s
[CV] kmeans__n_clusters=111 ..........................................
[CV] ........................... kmeans__n_clusters=111, total= 0.9s
[CV] kmeans__n_clusters=111 ..........................................
[CV] ........................... kmeans__n_clusters=111, total= 0.8s
[CV] kmeans__n_clusters=111 ..........................................
[CV] ........................... kmeans__n_clusters=111, total= 0.9s
[CV] kmeans__n_clusters=112 ..........................................
[CV] ........................... kmeans__n_clusters=112, total= 1.5s
[CV] kmeans__n_clusters=112 ..........................................
[CV] ........................... kmeans__n_clusters=112, total= 0.9s
[CV] kmeans__n_clusters=112 ..........................................
[CV] ........................... kmeans__n_clusters=112, total= 1.0s
[CV] kmeans__n_clusters=113 ..........................................
[CV] ........................... kmeans__n_clusters=113, total= 0.9s
[CV] kmeans__n_clusters=113 ..........................................
[CV] ........................... kmeans__n_clusters=113, total= 0.9s
[CV] kmeans__n_clusters=113 ..........................................
[CV] ........................... kmeans__n_clusters=113, total= 1.2s
[CV] kmeans__n_clusters=114 ..........................................
[CV] ........................... kmeans__n_clusters=114, total= 2.2s
[CV] kmeans__n_clusters=114 ..........................................
[CV] ........................... kmeans__n_clusters=114, total= 0.9s
[CV] kmeans__n_clusters=114 ..........................................
[CV] ........................... kmeans__n_clusters=114, total= 1.0s
[CV] kmeans__n_clusters=115 ..........................................
[CV] ........................... kmeans__n_clusters=115, total= 1.0s
[CV] kmeans__n_clusters=115 ..........................................
[CV] ........................... kmeans__n_clusters=115, total= 1.0s
[CV] kmeans__n_clusters=115 ..........................................
[CV] ........................... kmeans__n_clusters=115, total= 0.9s
[CV] kmeans__n_clusters=116 ..........................................
[CV] ........................... kmeans__n_clusters=116, total= 0.8s
[CV] kmeans__n_clusters=116 ..........................................
[CV] ........................... kmeans__n_clusters=116, total= 0.8s
[CV] kmeans__n_clusters=116 ..........................................
[CV] ........................... kmeans__n_clusters=116, total= 1.0s
[CV] kmeans__n_clusters=117 ..........................................
[CV] ........................... kmeans__n_clusters=117, total= 0.9s
[CV] kmeans__n_clusters=117 ..........................................
[CV] ........................... kmeans__n_clusters=117, total= 0.8s
[CV] kmeans__n_clusters=117 ..........................................
[CV] ........................... kmeans__n_clusters=117, total= 1.0s
[CV] kmeans__n_clusters=118 ..........................................
[CV] ........................... kmeans__n_clusters=118, total= 0.9s
[CV] kmeans__n_clusters=118 ..........................................
[CV] ........................... kmeans__n_clusters=118, total= 0.8s
[CV] kmeans__n_clusters=118 ..........................................
[CV] ........................... kmeans__n_clusters=118, total= 0.8s
[CV] kmeans__n_clusters=119 ..........................................
[CV] ........................... kmeans__n_clusters=119, total= 0.8s
[CV] kmeans__n_clusters=119 ..........................................
[CV] ........................... kmeans__n_clusters=119, total= 0.9s
[CV] kmeans__n_clusters=119 ..........................................
[CV] ........................... kmeans__n_clusters=119, total= 0.9s
[CV] kmeans__n_clusters=120 ..........................................
[CV] ........................... kmeans__n_clusters=120, total= 0.9s
[CV] kmeans__n_clusters=120 ..........................................
[CV] ........................... kmeans__n_clusters=120, total= 0.8s
[CV] kmeans__n_clusters=120 ..........................................
[CV] ........................... kmeans__n_clusters=120, total= 1.0s
[CV] kmeans__n_clusters=121 ..........................................
[CV] ........................... kmeans__n_clusters=121, total= 0.8s
[CV] kmeans__n_clusters=121 ..........................................
[CV] ........................... kmeans__n_clusters=121, total= 0.9s
[CV] kmeans__n_clusters=121 ..........................................
[CV] ........................... kmeans__n_clusters=121, total= 0.9s
[CV] kmeans__n_clusters=122 ..........................................
[CV] ........................... kmeans__n_clusters=122, total= 1.0s
[CV] kmeans__n_clusters=122 ..........................................
[CV] ........................... kmeans__n_clusters=122, total= 1.0s
[CV] kmeans__n_clusters=122 ..........................................
[CV] ........................... kmeans__n_clusters=122, total= 1.3s
[CV] kmeans__n_clusters=123 ..........................................
[CV] ........................... kmeans__n_clusters=123, total= 1.6s
[CV] kmeans__n_clusters=123 ..........................................
[CV] ........................... kmeans__n_clusters=123, total= 1.3s
[CV] kmeans__n_clusters=123 ..........................................
[CV] ........................... kmeans__n_clusters=123, total= 1.4s
[CV] kmeans__n_clusters=124 ..........................................
[CV] ........................... kmeans__n_clusters=124, total= 1.0s
[CV] kmeans__n_clusters=124 ..........................................
[CV] ........................... kmeans__n_clusters=124, total= 0.9s
[CV] kmeans__n_clusters=124 ..........................................
[CV] ........................... kmeans__n_clusters=124, total= 0.9s

[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 2.3min finished

GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=Pipeline(memory=None,
steps=[('kmeans',
KMeans(algorithm='auto',
copy_x=True,
init='k-means++',
max_iter=300,
n_clusters=50,
n_init=10,
n_jobs=None,

precompute_distances='auto',
random_state=None,
tol=0.0001,
verbose=0)),
('log_reg',
LogisticRegression(C=1.0,

class_weight=None,
dual=False,

fit_intercept=True,

intercept_scaling=1,

l1_ratio=None,

max_iter=100,

multi_class='auto',

n_jobs=None,

penalty='l2',

random_state=None,

solver='liblinear',
tol=0.0001,
verbose=0,

warm_start=False))],
verbose=False),
iid='warn', n_jobs=None,
param_grid={'kmeans__n_clusters': range(75, 125)},
pre_dispatch='2*n_jobs', refit=True,
return_train_score=False,
scoring=None, verbose=2)

grid_clf.best_params_

{'kmeans__n_clusters': 77}

grid_clf.score(X_test, y_test)

0.9888888888888889
With k=99, we got a significant accuracy boost just by reducing the dimensionality of the
dataset with unsupervised clustering before training a regressor.

Using Clustering for Semi-Supervised Learning


It's when we have plenty of unlabeled instances and very few labeled instances.

Let's train a logistic regression model on a sample of 50 labeled instances from the digits
dataset:

n_labeled = 50

log_reg = LogisticRegression(solver='liblinear', multi_class='auto')

log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])

LogisticRegression(C=1.0, class_weight=None, dual=False,


fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False)

log_reg.score(X_test, y_test)

0.8644444444444445

It should come as no surprise that this is much lower than earlier becasue we only used 50
training points.

Let's cluster the training set into 50 clusters, then for each cluster, let's find the image closest to
the centroid.

We will call these images the representative images:

k = 50

kmeans = KMeans(n_clusters=k)

X_digits_dist = kmeans.fit_transform(X_train)

import numpy as np

representative_digit_idx = np.argmin(X_digits_dist, axis=0)

X_representative_digits = X_train[representative_digit_idx]

Let's look at each image and manually label it:

y_representative_digits = y_train[representative_digit_idx].copy()
Now we have a dataset with just 50 labeled instances, but instead of being random instances,
each of them is a representative image of its cluster.

Let's see if the performance is any better:

log_reg = LogisticRegression(solver='liblinear', multi_class='auto')

log_reg.fit(X_representative_digits, y_representative_digits)

LogisticRegression(C=1.0, class_weight=None, dual=False,


fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False)

log_reg.score(X_test, y_test)

0.9088888888888889

We've made a big jump in performance even though we are still training using the same number
of data points. We could only do this by manually labelling representative instances. We get
representative instances by running unsupervised clustering (k clusters) and taking the k
instances closest to each of centroids.

However, what if we propagated the labels to all the other instances in the same cluster, this is
called label propagation:

y_train_propagated = np.empty(len(X_train), dtype=np.int32)

for i in range(k):
y_train_propagated[kmeans.labels_ == i] =
y_representative_digits[i]

Let's train the model and look at its performance:

log_reg = LogisticRegression(solver='liblinear', multi_class='auto')

log_reg.fit(X_train, y_train_propagated)

LogisticRegression(C=1.0, class_weight=None, dual=False,


fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False)

log_reg.score(X_test, y_test)
0.9355555555555556

We've got a reasonable boost but nothing really astounding.

The problem is that we propagared each representative instances label to all the instances in
the same cluster, including the instances located close to the cluster boundaries, which are
more likely to be mislabeled.

Let's see what happens if we only propagate the labels to the 20 of the instances that are closest
to the centroids:

percentile_closest = 20.

X_cluster_dist = X_digits_dist[np.arange(len(X_train)),
kmeans.labels_]

for i in range(k):
in_cluster = (kmeans.labels_ == i)
cluster_dist = X_cluster_dist[in_cluster]
cutoff_distance = np.percentile(cluster_dist, percentile_closest)
above_cutoff = (X_cluster_dist > cutoff_distance)
X_cluster_dist[in_cluster & above_cutoff] = -1

partially_propagated = (X_cluster_dist != -1)

X_train_partially_propagated = X_train[partially_propagated]

y_train_partially_propagated = y_train[partially_propagated]

Now let's train the model on this partially propagated dataset:

log_reg = LogisticRegression(solver='liblinear', multi_class='auto')

log_reg.fit(X_train_partially_propagated,
y_train_partially_propagated)

LogisticRegression(C=1.0, class_weight=None, dual=False,


fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False)

log_reg.score(X_test, y_test)

0.9333333333333333

To continue improving our model and training set, the next step could be to do a few rounds of
active learning. Which is when a human expert interacts with the learning algorithm, providing
labels for specific instances when the algorithm requests them.
There are many strategies for active learning, one of the most common ones is called
uncertainty sampling. Here is how uncertainty sampling works:

1. The model is trained on the labeled instances gathered so far. Model is used to make
predictions on all of the unlabeled instances.
2. The instances where the most is most uncertain are given to the expert to be labeled.
3. You iterate this process until the increase in performance doesn't worth the effort of
manual labeling.

Other strategies include labeling the instances that would result in the largest model change, or
the largest drop in validation error, or instances that different models disagree on.

Before moving to gaussian mixture models, we will take a look at DBSCAN which uses local
density estimation to identify clusters of arbitrary shapes.

DBSCAN
This algorithm defines clusters as continuous regions of high density, here is how it works:

For each instance, we count how many instances are located within a small distance ϵ from it
(This region is called the ϵ -neighborhood). If an instance has ≥ min_samples instances in its ϵ -
neighborhood, then it is considered a core instance. Core instances are those that are located in
dense regions.

All instances in the neighborhood of a core instance belong to the same cluster. This
neighborhood may include other core instances. Therefore, a long sequence of neighboring core
instances forms a single cluster.

Any instance that is not a core instance and does not have one in its neighborhood is considered
an anomaly.

DBSCAN works well if all the clsuters are dense enough and if they are well separated by low-
density regions.

Let's use it:

from sklearn.cluster import DBSCAN


from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.05)
X.shape, y.shape

((1000, 2), (1000,))

dbscan = DBSCAN(eps=0.05, min_samples=5)

dbscan.fit(X)

DBSCAN(algorithm='auto', eps=0.05, leaf_size=30, metric='euclidean',


metric_params=None, min_samples=5, n_jobs=None, p=None)

We access the labels as follows:


dbscan.labels_[:5]

array([0, 0, 1, 0, 2])

instances with labels -1 are considered anomalies by the algorithm.

We can also access core instances like this:

len(dbscan.core_sample_indices_)

792

# actual core instances coords


dbscan.components_

array([[ 1.97735134, 0.16746005],


[ 1.73191549, -0.24587221],
[ 0.04933691, 0.09386341],
...,
[ 0.9702814 , 0.18676075],
[-0.77970719, 0.60591176],
[ 0.39840368, 0.90286737]])

Let's plot our clusters:

plt.figure(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.axis('off')
plt.show()
Let's increase ϵ and replot:

dbscan = DBSCAN(eps=0.2, min_samples=5)

dbscan.fit(X)

DBSCAN(algorithm='auto', eps=0.2, leaf_size=30, metric='euclidean',


metric_params=None, min_samples=5, n_jobs=None, p=None)

plt.figure(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.axis('off')
plt.show()
DBSCAN doesn't have a predict() method for new instances, instead it let's us train a
classifier on its training targets to classify new instances.

Let's implement if:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)

knn.fit(dbscan.components_,
dbscan.labels_[dbscan.core_sample_indices_])

KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=50,
p=2,
weights='uniform')

X_new = np.array([[-0.5, 0.], [0, 0.5], [1, -0.1], [2, 1]])

knn.predict(X_new)

array([1, 0, 1, 0])

knn.predict_proba(X_new)
array([[0.22, 0.78],
[1. , 0. ],
[0.18, 0.82],
[1. , 0. ]])

Note that we only trained the classifier on the core instances, but we could also have chosen to
train it on all instances, or all but anomalies. It depends on the final task.

Because we didn't train our classifier on anomalies, any new instance will be put into a cluster. It
is fairly straightforward to introduce a maximum distance, in which case the two instances that
are far away from both clusters are classified as anomaoies.

y_dist, y_pred_idx = knn.kneighbors(X_new, n_neighbors=1)

y_pred = dbscan.labels_[dbscan.core_sample_indices_][y_pred_idx]

y_pred[y_dist > 0.2] = -1

y_pred.ravel()

array([-1, 0, 1, -1])

In short, DBSCAN is a very simple yet powerful algorithm capable of identifying any number of
clusters of any shape. It is robust to outliers and has just two hyper-parameters (eps and
min_samples).

If the densities vary significantly across the clusters, It can be impossible for it to capture all the
clusters properly. Its computational complexity is roughly O ( ml o g ( m ) ) making it pretty close to
linear regarding the number of instances. However, its sklearn implementation can require up
to O ( m 2 ) in memory if eps is large.

Other Clustering Algorithms


• Agglomerative Clustering
• BIRCH
• Mean-Shift: a complexity of O ( m 2 )
• Affinity Propagation: a complexity of O ( m2 )
• Spectral clustering: Does not scale well to large number of instances and doesn't behave
well when the clusters have very different sizes.

Gaussian Mixtures
A Gaussian Mixture Model (GMM) is a probabilitic model that assumes that the instances were
generated from a mixture of several Gaussian distributions whose parameters are unknown. All
the instances generated from a single gaussian distribution form a cluster that typically looks
like an ellipsoid (Bell, Circle, Ellipsoid). Each cluster can have a different ellipsoidal shape, size,
density, and orientation.
There are several GMM variants. In its simplest form (implemented in sklearn) the algorithm
requires in advance the number of gaussian distributions.

To give some context, let's consider a dataset X that is assumed to have been generated
through the following probabilistic process, Here is a graphical model that represents the
structure of the conditional dependencies between the random variables:

For each instance, A cluster is picked randomly from k clusters. The probability of choosing the
th ( j) th
j cluster is defined by the cluster's weight, ϕ . The index of the cluster chosen for the i
(i )
instance is noted z
(i )
x , the location of the instance is sampled randomly from the Gaussian distribution with mean
μ( j ) and covariance matrix σ ( j ). Meaning, x (i ) ∼ N ( μ( j) , σ ( j) )
We describe the different symbols in the above figure:

• squares: constants.
• rectangle containers: plates, their content is repeated several times.
• m , k : how many times the content is repeated
• solid arrow: conditional dependency
• Squishy arrow: A switch

Given the dataset X , we typically want to start by estimating the weights ϕ & all the distribution
parameters μ(1 ) … μ( k ) and Σ (1 ) … Σ (k )

sklearn makes this super easy:

X, y = datasets.make_blobs(n_samples=1000, n_features=2, centers=5,


cluster_std=[0.5, 0.5, 0.5, 1, 1])

plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], s=2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
from sklearn.mixture import GaussianMixture

gm = GaussianMixture(n_components=5, n_init=10)

gm.fit(X)

GaussianMixture(covariance_type='full', init_params='kmeans',
max_iter=100,
means_init=None, n_components=5, n_init=10,
precisions_init=None, random_state=None, reg_covar=1e-
06,
tol=0.001, verbose=0, verbose_interval=10,
warm_start=False,
weights_init=None)
gm.weights_

array([0.19939177, 0.2 , 0.2 , 0.19966467, 0.20094356])

gm.means_

array([[-5.43170033, -4.52582333],
[ 6.95589923, -4.83376105],
[-0.05573972, 3.97562412],
[-8.84512929, -7.56458988],
[-3.12203312, -6.46201983]])

gm.covariances_

array([[[ 1.18679153, 0.03475038],


[ 0.03475038, 0.93651411]],

[[ 1.09591031, -0.066672 ],
[-0.066672 , 1.01747048]],

[[ 0.2399561 , -0.03119187],
[-0.03119187, 0.2720167 ]],

[[ 0.31072852, -0.01550297],
[-0.01550297, 0.25520011]],

[[ 0.2652748 , 0.0243359 ],
[ 0.0243359 , 0.28724611]]])

But how did it work?

This class relies on the expectation maximization (EM) algorithm. Which has many similarity
with the k-means algorithm. We list its steps:

1. Initializes cluster parameters randomly.


2. Assigns instances to clusters.
3. Update the clusters.

We can think of EM as a generalization of KMeans that not only finds the cluster centers (
μ(1 ) … μ( k ) ), but also their size, shape, and orientation ( Σ (1 ) … Σ(k )) as well as their relative sampling
weights (ϕ ( 1) … ϕ (k )).

Unlike KMeans though, EM uses soft cluster assignments (with probabilities), not hard
assignments.

During the expectation step, EM estimates the probability that each instance belongs to each
cluster. During the maximization step, each cluster is updated using all the instances in the
dataset. These probabilities are also called the responsibilities of the clusters for the instances.

We can check if the algorithm converged & how many iterations it took like this:
gm.converged_

True

gm.n_iter_

Let's visualize hard vs. soft clustering with our trained model:

gm.predict(X)[:15]

array([1, 1, 1, 0, 3, 1, 1, 1, 2, 1, 1, 3, 2, 1, 2])

np.round(gm.predict_proba(X)[:15], 5)

array([[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],


[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[9.3636e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00, 6.3640e-02],
[1.0000e-05, 0.0000e+00, 0.0000e+00, 9.9999e-01, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[1.0000e-05, 0.0000e+00, 0.0000e+00, 9.9999e-01, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00]])

A Gaussian Mixture model is a generative model, meaning can sample new instances from its
source distributions:

X_new, y_new = gm.sample(6)

X_new

array([[ 7.86882656, -4.71892846],


[-9.26802379, -7.26607116],
[-9.64629111, -7.67137044],
[-4.16578781, -6.36179836],
[-3.58226792, -5.75477351],
[-2.71443066, -7.11024713]])

y_new

array([1, 3, 3, 4, 4, 4])
It is also possible to estimate the density of the model at any given location. The greater the
score, the higher the density:

gm.score_samples(X)[:10]

array([-4.15234724, -3.67533915, -4.6936191 , -4.85724922, -


2.31739792,
-3.57682272, -4.23676386, -4.73767025, -2.3685467 , -
4.43374727])

If we calculate the exponential of these scores, we will get the value of the PDF at the location of
the given instances.

The result aren't probabilities but probability densities. If we want the probability that an
instance falls within a region, we will have to integrate over the region.

Let's visualize PDF:

# Plotting decision regions


x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))

Z = np.exp(gm.score_samples(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 8))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k', alpha=0.1)
plt.show()
Nice, the algorithm clearly found an excellent solution. However, it was easy because we
generated the data from actual gaussian distributions and gave EM the correct number of
clusters. When there are many dimensions, or many clusters, or few instances, EM will struggle
to converge to the optimal solution. In this case, we need to regularize it. One way to do this is to
limit the number of shapes and orientations the distributions can take by constraining the space
of possible covariance matrices.

We may want to set the covariance_type hyper-parameter to one of the following values:

• Spherical.
• Diag.
• Tied.

Examples of hyperparameter values over the same dataset:


Gaussian Mixture's complexity depends on m, n, k & the constraints over the covariance matrix. if
its constraints fall into spheric or diag, complexity is O ( k mn ). If its hyperparameter falls into
tied or full, complexity is O ( k mn 2+ k n3 ), in this case, we can't scale to large number of
features.

Anomaly Detection using Gaussian Mixtures


Anomaly detection is the task of detecting instances that deviate strongly from the norm. These
instances are called anomalies or outliers, while the rest are called inliers.

Anomaly detection is useful in a wide range of applications, such as:

• Fraud detection
• Detecting defective products in manufacturing
• Removing outliers from a dataset before training a model, which can significantly
improve the performance of the resulting model.

In the context of a GMM, any instance located in a low density space region can be considered an
anomaly, but we must define what density threshold we want to use.

Here is how we would identify the outliers using the fourth percentile lowest density as the
threshold:

densities = gm.score_samples(X)

density_threshold = np.percentile(densities, 4)

anomalies = X[densities < density_threshold]

anomalies.shape

(40, 2)

A closely related task is novelty detection, it differs from anomaly detection in that the
algorithm is assumed to be trained on a clean dataset. Meaning whatever special instances you
find, they are not anomalies but rather rare novelties.

Just like KMeans, the GaussianMixture algorithm requires us to specify the number of
clusters. So how can we find it?

Selecting the Number of Clusters


We can try to find the model that minimizes a theoretical information criterion, such as ..

• the Bayesian information criterion (BIC)


• Akaike information criterion (AIC)

$$BIC = log(m)p - 2log(\hat{L}) \\ AIC = 2p - 2log(\hat{L})$$

Where:

• m is the number of instances.


• p is the number of parameters learned by the model.
• ^L is the maximized value of the likelihood function of the model.

BIC and AIC tend to penalize models that have more parameters to learn (e.g. more clusters).
They also reward models that fit the data well and often endup selecting the same model. When
they differ, the model selected by BIC tends to be simpler than the one selected by AIC, but
tends to not fit the data quite as well.

The difference between probability & likelihood: Given a statistical model with parameters θ ( M θ
). Probability of an outcome x given the model's parameters θ is: Pθ ( X=x ) . The Likelihood of
θ=ϕ given X =x is: P X =x ( θ=ϕ ) .

So:

$$\hat{L}=max_{\phi}\{P_{X=x}(\theta=\phi)\} \\ \hat{L}=max_{\phi}f(\theta=\phi;X)$$

The PDF is a function of x while θ is fixed, and the likelihood is a function of θ while x is fixed. It
is important to understand that the likelihood function is not a probability distribution.

Given a dataset X , a common task is to try to estimate the most likely values for the model
parameters, to do this, we must find θ that maximizes the likelihood function, given X .

Maximum a-posterior estimation: When we have a prior probability distribution over θ ( g ( θ ) ), it


is possible to take it into account by maximizing L ( θ∨x ) g ( θ ) instead of L ( θ∨x ). Because l o g is
increasing, maximizing the likelihood function is equivalent to maximizing its log. It is generally
easier to maximize the log likelihood because if we observed several instances, we would need
to maximize the likelihood of the product that consists of all instances, and because
l o g ( a b )=lo g ( a )+l o g ( b ), maximizing sums is much easier.

Once we find θ^ that maximizes the likelihood function, then we compute ^L=L ( θ^ , X ) which is
what we need for AIC & BIC.

Let's do it in sklearn:

gm.bic(X)

7515.986587277572

gm.aic(X)

7373.66168418709

The following figure shows the values of BIC & AIC for different cluster numbers:

As we can see, AIC & BIC are lowest when k =3 so it is most likely the best choice.

Bayesian Gaussian Mixture Models


Rather than manually searching for the optimal number of clusters, you can use the
BayesianGaussianMixture class, which is capable of giving weights equal (or close) to zero
to unnecessary clusters.
For example, let's set the number of clusters to 10 and see what happens:

from sklearn.mixture import BayesianGaussianMixture

bgm = BayesianGaussianMixture(n_components=10, n_init=10)

bgm.fit(X)

BayesianGaussianMixture(covariance_prior=None, covariance_type='full',
degrees_of_freedom_prior=None,
init_params='kmeans',
max_iter=100, mean_precision_prior=None,
mean_prior=None, n_components=10, n_init=10,
random_state=None, reg_covar=1e-06, tol=0.001,
verbose=0, verbose_interval=10,
warm_start=False,
weight_concentration_prior=None,

weight_concentration_prior_type='dirichlet_process')

np.round(bgm.weights_, 2)

array([0.2, 0.2, 0.2, 0.2, 0.2, 0. , 0. , 0. , 0. , 0. ])

The algorithm automatically detected that only 5 clusters are needed.

In this model, the cluster parameters (weights, means, co-variance matrices) are not treated as
fixed model parameters anymore, but as latent random variables, like the cluster assignments.
So z now includes both cluster parameters & cluster assignments.

The beta distribution is commonly used to model random variables whose values lie within a
fixed range. In this case, the range is from 0 to 1.

The stick-breaking process is a good model for datasets where new instances are more likely to
join large clusters than small clusters. E.g. people love to move to larger cities.

If the concentration α is high, then ϕ values will likely be close to 0, and the SBP will generate
more clusters. The wishard distribution is used to sample covariance matrices, the parameters d
and V control the distribution of cluster shapes.

Prior knowledge about the latent variables z can be encoded in a probability distribution p ( z )
called the prior. For example, we may have a prior belief that the clusters are likely to be few
(Low concentration).

The more data we have, the more the prior matters.

Bayes' theorem tells us how to update the probability distribution over the latent variables after
we observe some data X . It computes the posterior distribution p ( z∨X ) , which is the
conditional probability of z given X .

P ( X ∨z ) P ( z )
P ( z∨X ) =
P(X)
The evidence P ( X ) is often intractable:

P ( X )= ∫ P ( X∨z ) P ( z ) d z
This intractability is one of the central problems is Bayesian Statistics, and there are several
approaches to solving it. One of them is variational inference. Which picks a famility of
distributions q ( z ; λ ) ( λ : variational parameter) then optimizes these parameters to make
q ( z ) ≈ p ( z∨ X ). This is achieved by finding λ that minimizes the KL divergence between q ( z ) and
p ( z∨X ) (Noted D K L ( q|) p )).
The KL equation can be re-written as the log of the evidence minus the evidence lower bound
(ELBO).

D K L ( q|) p ) =l o g p ( X ) − E L B O
Since the log of the evidence doesn't depend on q , minimizing KL divergence is equivalent to
minimizing the ELBO expression. In practice, there are different techniques to minimizing the
ELBO.

In mean field variational inference, it is necessary to pick a family of distributions q ( z ; λ ) and the
prior p ( z ) very carefully to ensure that the equation for the ELBO simplifies to a form that can be
computed, unfortunately, there is no general way to do this.

An opproach to maximizing the ELBO is called black box stochastic variational inference (BBSVI):
At each iteration a few samples are drawn from q and they're used to estimate the gradients of
the ELBO with regards to the variational parameters of λ , which are then used in a gradient
ascent step.

This approach makes it possible to use Bayesian inference with any differentiable system, even
deep neural networks (called bayesian Deep Learning).

Now let's take a look at other algorithms capable of dealing with arbitrary cluster shapes.

Other Algorithms for Anomaly & Novelty Detection


• PCA: If we compare the reconstruction error of a normal instance versus the
reconstruction error of an anomaly, the latter will be greater.
• Fast-MCD: It assumes that normal instances are generated from a single gaussian
distribution, and is good at ignoring the others.
• Isolation Forest: uses random forest to split instances into groups and isolate the outliers
randomly.
• Local Outlier Factor: It compares the density around an instance to the densities around
its neighbors. An anomaly is often isolated in comparison to its neighbors.
• One-class SVM: It tries to seperate the instances in high-dimensional space from the
origin, hence creating a small region that compasses all in the low-dim space. If a new
instance doesn't fall within this region, It is an anomaly.
Exercices
1. How would you define clustering? Can you name a few clustering algorithms?

Clustering is the process of uncovering groups (or clusters) within unlabaeled data instances. It
may use distance or density measures.

Algorithms: KMeans, DBSCAN, GMMs, ...

2. What are some of the main applications of clustering algorithms?

Customer Segmentation, Fraud Detection, Novelty Detection, Social Networks Analysis


(Communities), data analysis/visualization, dimensionality reduction ...

3. Describe 2 techniques to select the right number of clusters when using K-Means?

Plotting the silouhette density score in relation to the chosen number of clusters then
choosing the number of clusters that maximizes the score.

The Elbow rule: By choosing the number of clusters that decreased the inretia the most.

4. What is label propagation? Why would you implement it? and How?

Label propagation is useful in the context of semi-supervised learning where we have few
labeled points and a lot of unlabeled instances and we want to propagate the labels from the
annotated samples to the unlabeled ones.

We would want to to implement it if we have a downstream supervised learning task that needs
more data.

A simple strategy would be distance-based propagation, where first we will train an algorithm to
do unsupervised clustering and then propagate the labels into the ones that correspond to the
same clusters of the annotated instances.

5. Can you name two clustering algorithms that can scale to large datasets? and two that look
for regions of high density?

DBSCAN (if ϵ is small enough), and KMeans (if dataset is comprised of clusters), and BIRCH.

DBSCAN and MeanShift.

6. Can you think of a use case when active learning would be useful? How would you
implement it?

It would be useful when we're having a small labeled dataset and a lot of unlabeled instances.

First we train the algorithm of the available labeled instances. Then we predict on the unlabeled
instances and give the annotator the instances that the algorithm isn't sure about. We loop until
the increase in performance in not noticeable anymore (called uncertainty sampling).

7. What is the difference between Anomaly Detection & Novelty Detection?

Anomaly Detection: Detecting instances that don't belong to the same data distribution of the
training data.
Novelty Detection: Detecting instances that look extreme but belong to the same distribution of
the training data.

8. What is a Gaussian Mixture? What Tasks can you use it for?

Gaussian Mixture is a probabilistic model comprised of multiple gaussian distributions with


multiple parameters ( μ and σ ) and a set of weights (ϕ ) that describe the probability that an
instance belongs to one of the distributions.

In our case, it was used to uncover clusters in unlabeled data assuming that all of the clusters
correspond to gaussian distributions.

9. Can you find two techniques to find the right number of clusters when using a Gaussian
mixture model?

Plotting the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) as a
function of the number of clusters then chose the number that minimizes either.

Using Bayesian Gaussian Mixture Models to avoid random search and using priors instead.

10. The classic Olivetti faces dataset contains 400 gray-scale 64 × 64 -pixel images of faces.
Each image is flattened to a 1D vector of size 4 , 096. 40 different people were photographed
(10 times each), and the usual task is to train a model to predict which person is represented
in each picture. Load the dataset using sklearn.datasets.fetch_olivetti_faces()
function, then split it into training/validation and test set. Since the dataset is quite small, you
probability want to use stratefied sampling to ensure that there are the same number of
images per person in each set

of = datasets.fetch_olivetti_faces()

X, y = of.data, of.target
X.shape, y.shape

((400, 4096), (400,))

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,


stratify=y_train)

X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape,


y_test.shape

((225, 4096), (225,), (75, 4096), (75,), (100, 4096), (100,))

Cluster the images using KMeans, and measure that you have a good number of clusters.
Visualize the clusters: do you see similar faces in each cluster?

kms = KMeans()
kms.fit(X_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,


n_clusters=8, n_init=10, n_jobs=None,
precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)

inretias = list()
for k in range(1, 20):
kms = KMeans(n_clusters=k)
kms.fit(X_train)
inretias.append(kms.inertia_)

plt.plot(range(1, 20), inretias)


plt.xlabel('n_clusters')
plt.ylabel('Inretia')
plt.show()

Based on the figure we choose 5 as the number of clusters.

Visualize the clusters: Do you see similar faces in each cluster?

Yes, let's perform TSNE and visualize the images:

from sklearn.manifold import TSNE

tsne = TSNE()
X_train_reduced = tsne.fit_transform(X_train)
X_train_reduced.shape

(225, 2)

from matplotlib.offsetbox import AnnotationBbox, OffsetImage

fig, ax = plt.subplots(figsize=(15, 15))


ax.scatter(X_train_reduced[:, 0], X_train_reduced[:,1])

for idx in range(len(X_train_reduced)):


ab = AnnotationBbox(OffsetImage(X_train[idx].reshape(64, 64),
zoom=0.7), (X_train_reduced[idx, 0], X_train_reduced[idx, 1]),
frameon=False)
ax.add_artist(ab)

plt.axis('off')
plt.show()
11. Continuing with the Olivetti faces dataset, train a classifier to predict which person is
represented in each picture and evaluate it on the validation set

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)

rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=None, max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)

rf.score(X_val, y_val)

0.88

Next, use KMeans as a dimensionality reduction tool, and train a classifier on the reduced set.
Search for the number of clusters that allows the classifier to get the best performance: What
performance can you reach?

Around 25 cluster gives us a 68% accuracy.

clusters, val_accs = list(), list()

for k in range(1, 40):


kms = KMeans(n_clusters=k).fit(X_train)
X_train_tmp = kms.transform(X_train)
X_val_tmp = kms.transform(X_val)
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_tmp, y_train)
clusters.append(k)
val_accs.append(rf.score(X_val_tmp, y_val))

plt.plot(clusters, val_accs)
plt.xlabel('clusters: k')
plt.ylabel('validation accuracy')
plt.show()
What if you append the features from the reduced set to the original features?

26 to 32 seems good (but we need bigger validation sets) with >= 91% accuracy.

clusters, val_accs = list(), list()

for k in range(1, 40):


kms = KMeans(n_clusters=k).fit(X_train)
X_train_tmp = np.concatenate((X_train, kms.transform(X_train)),
axis=1)
X_val_tmp = np.concatenate((X_val, kms.transform(X_val)), axis=1)
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_tmp, y_train)
clusters.append(k)
val_accs.append(rf.score(X_val_tmp, y_val))

plt.plot(clusters, val_accs)
plt.xlabel('clusters: k')
plt.ylabel('validation accuracy')
plt.show()
12. Train a Gaussian Mixture Model on the Olivetti faces dataset. To speed up the algorithm,
you should probably reduce the dataset's dimensionality (Use PCA, preserving 99% of the
variance)

from sklearn.decomposition import PCA

pca = PCA(n_components=.99)

X_train_reduced = pca.fit_transform(X_train)

from sklearn.mixture import GaussianMixture

gm = GaussianMixture(n_components=25)

gm.fit(X_train_reduced)

GaussianMixture(covariance_type='full', init_params='kmeans',
max_iter=100,
means_init=None, n_components=25, n_init=1,
precisions_init=None, random_state=None, reg_covar=1e-
06,
tol=0.001, verbose=0, verbose_interval=10,
warm_start=False,
weights_init=None)

Use the model to generate some new faces and visualize them

generated_samples, _ = gm.sample(n_samples=100)
generated_faces = pca.inverse_transform(generated_samples)

tsne = TSNE()
generated_faces_reduced = tsne.fit_transform(generated_faces)
generated_faces_reduced.shape

(100, 2)

fig, ax = plt.subplots(figsize=(15, 15))


ax.scatter(generated_faces_reduced[:, 0],
generated_faces_reduced[:,1])

for idx in range(len(generated_faces)):


ab = AnnotationBbox(OffsetImage(generated_faces[idx].reshape(64,
64), zoom=0.7), (generated_faces_reduced[idx, 0],
generated_faces_reduced[idx, 1]), frameon=False)
ax.add_artist(ab)

plt.axis('off')
plt.show()
13. Some dimensionality reduction techniques can also be used for anomaly detection. For
example, take the Olivetti dataset and reduce it with PCA, preserving 99% of the variance.

pca = PCA(n_components=.99)
X_train_reduced = pca.fit_transform(X_train)

Then compute the reconstruction error for each image

X_train_reconstructed = pca.inverse_transform(X_train_reduced)

X_train_reconstruction_error = ((X_train_reconstructed - X_train) **


2).mean(axis=1)
import seaborn as sns

sns.distplot(X_train_reconstruction_error)

<matplotlib.axes._subplots.AxesSubplot at 0x1a44766e10>

Let's look at the anomaly faces:

anomaly_indices = X_train_reconstruction_error.argsort()[-5:][::-1]

tsne = TSNE()
X_train_reduced = tsne.fit_transform(X_train)

fig, ax = plt.subplots(figsize=(15, 15))


ax.scatter(X_train_reduced[:, 0], X_train_reduced[:, 1])

for idx in range(len(X_train)):


ab = AnnotationBbox(OffsetImage(X_train[idx].reshape(64, 64),
cmap='binary_r', zoom=0.7), (X_train_reduced[idx, 0],
X_train_reduced[idx, 1]), frameon=False)
ax.add_artist(ab)

for idx in anomaly_indices:


ab = AnnotationBbox(OffsetImage(X_train[idx].reshape(64, 64),
zoom=1.5), (X_train_reduced[idx, 0], X_train_reduced[idx, 1]),
frameon=False)
ax.add_artist(ab)

plt.axis('off')
plt.show()

You might also like