09.unsupervised Learning
09.unsupervised Learning
Unsupervised Learning
Although most of the applications of machine learning algorithms today are based on
supervised learning, The vast majority of the available data is unlabeled. We have the input X ,
but we don't have the labels y .
• Clustering: the goal is to group similar instances together into clusters. Clustering is a
great tool for data analysis, customer segmentation, recommender systems, search
engines, image segmentation, semi-supervised learning,..
• Anomaly Detection: This is the task of estimating the probability density function of the
random process that generated the dataset. In anomaly detection, instances located in
very low-density regions are likley to be anomalies.
Clustering
Clustering is the task of identifying groups containing similar objects. Applications of clustering:
• Image similarity search: clustering available images & when a new item is provided by
the user, cluster it using the same algorithm and return the top N centered items.
• Image segmentation: by clustering pixels according to their color, then replacing pixel
colors with the mean of its cluster.
• Data Analysis: it might be useful to cluster the instance and analyse each separately.
• Dimensionality Reduction: by replacing the features with each instance's affinity to each
cluster.
• Anomaly Detection: any instance that have low affinity to all clusters is likely to be an
outlier.
• Semi-supervised Learning: If we have a few labels, we can perform clustering and
propagate the available labels to other instances within the clusters.
Let's Consider the example of unsupervised clustering of the Iris dataset. If we remove its labels,
we can't use classification algorithms. Unsupervised Learning Clustering make use of all
available features to locate clusters and assign all instances to one of them. A trained Gaussian
Mixture Model results in only 5/150 wrongly assigned data points.
There are different types of clustering algorithms and there isn't a universal definition of what a
cluster is.
In this section, we will look at two widely used unsupervised learning clustering algorithms, the
first one is KMeans and the second is DBSCAN.
K-Means
Let's start with makes some data:
from sklearn import datasets
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], s=2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
y_pred is kmeans.labels_
True
We can also take a look at the five centroids the algorithm found:
kmeans.cluster_centers_
kmeans.predict(X_new)
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 8))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=20, edgecolor='k')
plt.show()
The vast majority of the instances were clearly assigned to their original cluster.
All KMeans cares about is the distance between instances and the centroids. Instead of assigning
each instance to a cluster (hard clustering) it's better to give a per-instance cluster score (soft
clustering). The score can be the distance between the instance and the centroids (this can also
be a good dimensionality reduction technique).
In sklearn, the transform method measures the distance between each instance and the
centroids.
kmeans.transform(X_new)
KMeans Algorithm
Let's suppose we were given the centroids. We could easily label all the instances in the dataset
by assigning each of them to the cluster with the closest centroid. Conversely, if we were given
all the instance labels, you could easily locate all the centroids by computing the mean of the
instances within each cluster. But we are given neither the labels nor the centroids, so how can
you proceed?
By just picking centroids randomly. By picking k instance at random and using their locations as
centroids, then label the instances, update the centroids, label the instances, update the
centroids, and so on until the centroids stop moving.
The algorithm is guaranteed to converge in a finite number of steps (usually very small) & It will
not oscillate forever.
The computational complexity of the algorithm is generally linear with regard to the number of
instances m , the number of clusters k , and the number of dimensions n . However, this is only
true when the data is comprised of clusters. It has an expenential complexity if the instances are
not structured in clusters.
KMeans is generally one of the fastest clustering algorithms. Even though the algorithm is
guaranteed to converge, it may not converge to the right solution (it may converge to a local
optimum), this depends on the initialization step.
good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
Another solution is to run the algorithm multiple times with different random initialization and
keep the best solution, this is controlled by the n_init hyperparameter. Scikit-learn will
keep the best solution for you by running the algorithm n_init times. However, for the
algorithm to know what signifies the word best, it uses a performance metric, that metric is
called the inretia, Which is the mean squared distance between each centroid and its
corresponding instances.
Another important improvement to the K-means algorithm was proposed and it considerably
accelerated the algorithm by avoiding many unecessary distance calculations. This was achieved
by exploiting the triangle inequality: The Norm of a Straight line is always the shortest distance
between two points. The algorithm keeps track of lower and upper bounds for distances
between instances and centroids. This Improvement is also implemented within sklearn.
Another paper proposed using mini-batches instead of the whole dataset for each iteration. This
Speeds up the algorithm typically by a factor of three or four and make it possible to cluster
huge datasets that don't fit into memory.
minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X)
If our dataset can't fit in memory, we can use memmap with the partial_fit() method.
The advantage of using MiniBtachKMeans becomes significant when we choose a big k for
clusters. Batching becomes much faster and the performance stays roughly the same:
This method of choosing the optimal number of clusters is rather coarse. A more precise and
computationally expensive appraoch is to use the silhouette score, which is the mean silhouette
coefficient over all the instances.
b− a
An instance silhouette score is equal to ∈ [ −1 ,1 )
ma x ( a , b )
• a : mean distance to other instances in the same cluster.
• b : mean distance to instances in the next closest cluster.
• +1 means the instance is well inside its own cluster and far from other clusters.
• 0 means the instance is sitting on the edge between two clusters.
• −1 means the instance may just be on the wrong cluster.
The silhouette score measure a cluster density score.
silhouette_score(X, minibatch_kmeans.labels_)
0.7217271740816389
scores = list()
for k in range(2, 10):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
scores.append(silhouette_score(X, kmeans.labels_))
del(kmeans)
When most of the instances in a cluster have a lower coefficient than this score then the cluster
is rather bad since this means its instances are much too close to other clusters. 4 and 5 look
fine and it seems like a good idea to use k =5 to get clusters of similar sizes.
Limits of K-Means
KMeans is not perfect, so it is necessary to run the algorithm multiple times to avoid suboptimal
solutions.
Aothor limiting factor of the algorithm is that we need to specify the number of clusters.
KMeans also doesn't behave very well when the clusters have varying sizes, different densities,
or non-spherical shapes.
Depending on the data, different clustering algorithms may perform better (like DBSCAN or
Gaussian Mixtures).
In instance segmentation, all pixels that are part of the same object get assigned to the same
segment. The state of the art in semantic or instance segmentation today is achieved using
complex architectures based on convolutional neural networks.
Here, we are going to do something much simpler, color segmentation. We will simply assign
pixels to the same segment if they have similar color. Application: If we want to assess forest
cover in a satellite image, color segmentation may be enough.
Let's do it:
image.shape
(850, 1280, 3)
plt.figure(figsize=(12, 8))
plt.imshow(image.astype(int))
plt.axis('off')
plt.show()
X = image.reshape((-1, 3))
X.shape
(1088000, 3)
kmeans = KMeans(n_clusters=8).fit(X)
segmented_image = kmeans.cluster_centers_[kmeans.labels_]
segmented_image = segmented_image.reshape(image.shape)
plt.figure(figsize=(12, 8))
plt.imshow(segmented_image.astype(int))
plt.axis('off')
plt.show()
Far away points may correspond to the same cluster because KMeans is not aware of spatial
positioning, its input points correspond to 3D RGB points. It clusters them according to how
much they're close to each other.
As an example of using clustering for dimensionality reduction, let's tackle the small-digits
dataset:
X, y = load_digits(return_X_y=True)
X.shape, y.shape
log_reg.fit(X_train, y_train)
log_reg.score(X_test, y_test)
0.9711111111111111
We create a pipeline that will first cluster the training set into 50 clusters and replace the
images with their distances to these 50 clusters, then apply a logistic regression model:
pipeline = Pipeline(steps=[
("kmeans", KMeans(n_clusters=50)),
("log_reg", LogisticRegression(solver='liblinear',
multi_class='auto'))
])
pipeline.fit(X_train, y_train)
Pipeline(memory=None,
steps=[('kmeans',
KMeans(algorithm='auto', copy_x=True, init='k-means+
+',
max_iter=300, n_clusters=50, n_init=10,
n_jobs=None,
precompute_distances='auto',
random_state=None,
tol=0.0001, verbose=0)),
('log_reg',
LogisticRegression(C=1.0, class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None,
penalty='l2', random_state=None,
solver='liblinear', tol=0.0001,
verbose=0,
warm_start=False))],
verbose=False)
pipeline.score(X_test, y_test)
0.9777777777777777
By reducing the dimensionality of the Input, we removed much of the noise and patterns and the
instances were easier to get recognized by the logistic regressor, but we choose the number of
clusters arbitrarily, we can surely do better.
We can use GridSearchCV to find the optimal number of clusters based on the final scoring by
Logistic Regression:
param_dict = dict(kmeans__n_clusters=range(75,125))
grid_clf.fit(X_train, y_train)
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=Pipeline(memory=None,
steps=[('kmeans',
KMeans(algorithm='auto',
copy_x=True,
init='k-means++',
max_iter=300,
n_clusters=50,
n_init=10,
n_jobs=None,
precompute_distances='auto',
random_state=None,
tol=0.0001,
verbose=0)),
('log_reg',
LogisticRegression(C=1.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
l1_ratio=None,
max_iter=100,
multi_class='auto',
n_jobs=None,
penalty='l2',
random_state=None,
solver='liblinear',
tol=0.0001,
verbose=0,
warm_start=False))],
verbose=False),
iid='warn', n_jobs=None,
param_grid={'kmeans__n_clusters': range(75, 125)},
pre_dispatch='2*n_jobs', refit=True,
return_train_score=False,
scoring=None, verbose=2)
grid_clf.best_params_
{'kmeans__n_clusters': 77}
grid_clf.score(X_test, y_test)
0.9888888888888889
With k=99, we got a significant accuracy boost just by reducing the dimensionality of the
dataset with unsupervised clustering before training a regressor.
Let's train a logistic regression model on a sample of 50 labeled instances from the digits
dataset:
n_labeled = 50
log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])
log_reg.score(X_test, y_test)
0.8644444444444445
It should come as no surprise that this is much lower than earlier becasue we only used 50
training points.
Let's cluster the training set into 50 clusters, then for each cluster, let's find the image closest to
the centroid.
k = 50
kmeans = KMeans(n_clusters=k)
X_digits_dist = kmeans.fit_transform(X_train)
import numpy as np
X_representative_digits = X_train[representative_digit_idx]
y_representative_digits = y_train[representative_digit_idx].copy()
Now we have a dataset with just 50 labeled instances, but instead of being random instances,
each of them is a representative image of its cluster.
log_reg.fit(X_representative_digits, y_representative_digits)
log_reg.score(X_test, y_test)
0.9088888888888889
We've made a big jump in performance even though we are still training using the same number
of data points. We could only do this by manually labelling representative instances. We get
representative instances by running unsupervised clustering (k clusters) and taking the k
instances closest to each of centroids.
However, what if we propagated the labels to all the other instances in the same cluster, this is
called label propagation:
for i in range(k):
y_train_propagated[kmeans.labels_ == i] =
y_representative_digits[i]
log_reg.fit(X_train, y_train_propagated)
log_reg.score(X_test, y_test)
0.9355555555555556
The problem is that we propagared each representative instances label to all the instances in
the same cluster, including the instances located close to the cluster boundaries, which are
more likely to be mislabeled.
Let's see what happens if we only propagate the labels to the 20 of the instances that are closest
to the centroids:
percentile_closest = 20.
X_cluster_dist = X_digits_dist[np.arange(len(X_train)),
kmeans.labels_]
for i in range(k):
in_cluster = (kmeans.labels_ == i)
cluster_dist = X_cluster_dist[in_cluster]
cutoff_distance = np.percentile(cluster_dist, percentile_closest)
above_cutoff = (X_cluster_dist > cutoff_distance)
X_cluster_dist[in_cluster & above_cutoff] = -1
X_train_partially_propagated = X_train[partially_propagated]
y_train_partially_propagated = y_train[partially_propagated]
log_reg.fit(X_train_partially_propagated,
y_train_partially_propagated)
log_reg.score(X_test, y_test)
0.9333333333333333
To continue improving our model and training set, the next step could be to do a few rounds of
active learning. Which is when a human expert interacts with the learning algorithm, providing
labels for specific instances when the algorithm requests them.
There are many strategies for active learning, one of the most common ones is called
uncertainty sampling. Here is how uncertainty sampling works:
1. The model is trained on the labeled instances gathered so far. Model is used to make
predictions on all of the unlabeled instances.
2. The instances where the most is most uncertain are given to the expert to be labeled.
3. You iterate this process until the increase in performance doesn't worth the effort of
manual labeling.
Other strategies include labeling the instances that would result in the largest model change, or
the largest drop in validation error, or instances that different models disagree on.
Before moving to gaussian mixture models, we will take a look at DBSCAN which uses local
density estimation to identify clusters of arbitrary shapes.
DBSCAN
This algorithm defines clusters as continuous regions of high density, here is how it works:
For each instance, we count how many instances are located within a small distance ϵ from it
(This region is called the ϵ -neighborhood). If an instance has ≥ min_samples instances in its ϵ -
neighborhood, then it is considered a core instance. Core instances are those that are located in
dense regions.
All instances in the neighborhood of a core instance belong to the same cluster. This
neighborhood may include other core instances. Therefore, a long sequence of neighboring core
instances forms a single cluster.
Any instance that is not a core instance and does not have one in its neighborhood is considered
an anomaly.
DBSCAN works well if all the clsuters are dense enough and if they are well separated by low-
density regions.
X, y = make_moons(n_samples=1000, noise=0.05)
X.shape, y.shape
dbscan.fit(X)
array([0, 0, 1, 0, 2])
len(dbscan.core_sample_indices_)
792
plt.figure(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.axis('off')
plt.show()
Let's increase ϵ and replot:
dbscan.fit(X)
plt.figure(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.axis('off')
plt.show()
DBSCAN doesn't have a predict() method for new instances, instead it let's us train a
classifier on its training targets to classify new instances.
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(dbscan.components_,
dbscan.labels_[dbscan.core_sample_indices_])
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=50,
p=2,
weights='uniform')
knn.predict(X_new)
array([1, 0, 1, 0])
knn.predict_proba(X_new)
array([[0.22, 0.78],
[1. , 0. ],
[0.18, 0.82],
[1. , 0. ]])
Note that we only trained the classifier on the core instances, but we could also have chosen to
train it on all instances, or all but anomalies. It depends on the final task.
Because we didn't train our classifier on anomalies, any new instance will be put into a cluster. It
is fairly straightforward to introduce a maximum distance, in which case the two instances that
are far away from both clusters are classified as anomaoies.
y_pred = dbscan.labels_[dbscan.core_sample_indices_][y_pred_idx]
y_pred.ravel()
array([-1, 0, 1, -1])
In short, DBSCAN is a very simple yet powerful algorithm capable of identifying any number of
clusters of any shape. It is robust to outliers and has just two hyper-parameters (eps and
min_samples).
If the densities vary significantly across the clusters, It can be impossible for it to capture all the
clusters properly. Its computational complexity is roughly O ( ml o g ( m ) ) making it pretty close to
linear regarding the number of instances. However, its sklearn implementation can require up
to O ( m 2 ) in memory if eps is large.
Gaussian Mixtures
A Gaussian Mixture Model (GMM) is a probabilitic model that assumes that the instances were
generated from a mixture of several Gaussian distributions whose parameters are unknown. All
the instances generated from a single gaussian distribution form a cluster that typically looks
like an ellipsoid (Bell, Circle, Ellipsoid). Each cluster can have a different ellipsoidal shape, size,
density, and orientation.
There are several GMM variants. In its simplest form (implemented in sklearn) the algorithm
requires in advance the number of gaussian distributions.
To give some context, let's consider a dataset X that is assumed to have been generated
through the following probabilistic process, Here is a graphical model that represents the
structure of the conditional dependencies between the random variables:
For each instance, A cluster is picked randomly from k clusters. The probability of choosing the
th ( j) th
j cluster is defined by the cluster's weight, ϕ . The index of the cluster chosen for the i
(i )
instance is noted z
(i )
x , the location of the instance is sampled randomly from the Gaussian distribution with mean
μ( j ) and covariance matrix σ ( j ). Meaning, x (i ) ∼ N ( μ( j) , σ ( j) )
We describe the different symbols in the above figure:
• squares: constants.
• rectangle containers: plates, their content is repeated several times.
• m , k : how many times the content is repeated
• solid arrow: conditional dependency
• Squishy arrow: A switch
Given the dataset X , we typically want to start by estimating the weights ϕ & all the distribution
parameters μ(1 ) … μ( k ) and Σ (1 ) … Σ (k )
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], s=2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
from sklearn.mixture import GaussianMixture
gm = GaussianMixture(n_components=5, n_init=10)
gm.fit(X)
GaussianMixture(covariance_type='full', init_params='kmeans',
max_iter=100,
means_init=None, n_components=5, n_init=10,
precisions_init=None, random_state=None, reg_covar=1e-
06,
tol=0.001, verbose=0, verbose_interval=10,
warm_start=False,
weights_init=None)
gm.weights_
gm.means_
array([[-5.43170033, -4.52582333],
[ 6.95589923, -4.83376105],
[-0.05573972, 3.97562412],
[-8.84512929, -7.56458988],
[-3.12203312, -6.46201983]])
gm.covariances_
[[ 1.09591031, -0.066672 ],
[-0.066672 , 1.01747048]],
[[ 0.2399561 , -0.03119187],
[-0.03119187, 0.2720167 ]],
[[ 0.31072852, -0.01550297],
[-0.01550297, 0.25520011]],
[[ 0.2652748 , 0.0243359 ],
[ 0.0243359 , 0.28724611]]])
This class relies on the expectation maximization (EM) algorithm. Which has many similarity
with the k-means algorithm. We list its steps:
We can think of EM as a generalization of KMeans that not only finds the cluster centers (
μ(1 ) … μ( k ) ), but also their size, shape, and orientation ( Σ (1 ) … Σ(k )) as well as their relative sampling
weights (ϕ ( 1) … ϕ (k )).
Unlike KMeans though, EM uses soft cluster assignments (with probabilities), not hard
assignments.
During the expectation step, EM estimates the probability that each instance belongs to each
cluster. During the maximization step, each cluster is updated using all the instances in the
dataset. These probabilities are also called the responsibilities of the clusters for the instances.
We can check if the algorithm converged & how many iterations it took like this:
gm.converged_
True
gm.n_iter_
Let's visualize hard vs. soft clustering with our trained model:
gm.predict(X)[:15]
array([1, 1, 1, 0, 3, 1, 1, 1, 2, 1, 1, 3, 2, 1, 2])
np.round(gm.predict_proba(X)[:15], 5)
A Gaussian Mixture model is a generative model, meaning can sample new instances from its
source distributions:
X_new
y_new
array([1, 3, 3, 4, 4, 4])
It is also possible to estimate the density of the model at any given location. The greater the
score, the higher the density:
gm.score_samples(X)[:10]
If we calculate the exponential of these scores, we will get the value of the PDF at the location of
the given instances.
The result aren't probabilities but probability densities. If we want the probability that an
instance falls within a region, we will have to integrate over the region.
Z = np.exp(gm.score_samples(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 8))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k', alpha=0.1)
plt.show()
Nice, the algorithm clearly found an excellent solution. However, it was easy because we
generated the data from actual gaussian distributions and gave EM the correct number of
clusters. When there are many dimensions, or many clusters, or few instances, EM will struggle
to converge to the optimal solution. In this case, we need to regularize it. One way to do this is to
limit the number of shapes and orientations the distributions can take by constraining the space
of possible covariance matrices.
We may want to set the covariance_type hyper-parameter to one of the following values:
• Spherical.
• Diag.
• Tied.
• Fraud detection
• Detecting defective products in manufacturing
• Removing outliers from a dataset before training a model, which can significantly
improve the performance of the resulting model.
In the context of a GMM, any instance located in a low density space region can be considered an
anomaly, but we must define what density threshold we want to use.
Here is how we would identify the outliers using the fourth percentile lowest density as the
threshold:
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies.shape
(40, 2)
A closely related task is novelty detection, it differs from anomaly detection in that the
algorithm is assumed to be trained on a clean dataset. Meaning whatever special instances you
find, they are not anomalies but rather rare novelties.
Just like KMeans, the GaussianMixture algorithm requires us to specify the number of
clusters. So how can we find it?
Where:
BIC and AIC tend to penalize models that have more parameters to learn (e.g. more clusters).
They also reward models that fit the data well and often endup selecting the same model. When
they differ, the model selected by BIC tends to be simpler than the one selected by AIC, but
tends to not fit the data quite as well.
The difference between probability & likelihood: Given a statistical model with parameters θ ( M θ
). Probability of an outcome x given the model's parameters θ is: Pθ ( X=x ) . The Likelihood of
θ=ϕ given X =x is: P X =x ( θ=ϕ ) .
So:
$$\hat{L}=max_{\phi}\{P_{X=x}(\theta=\phi)\} \\ \hat{L}=max_{\phi}f(\theta=\phi;X)$$
The PDF is a function of x while θ is fixed, and the likelihood is a function of θ while x is fixed. It
is important to understand that the likelihood function is not a probability distribution.
Given a dataset X , a common task is to try to estimate the most likely values for the model
parameters, to do this, we must find θ that maximizes the likelihood function, given X .
Once we find θ^ that maximizes the likelihood function, then we compute ^L=L ( θ^ , X ) which is
what we need for AIC & BIC.
Let's do it in sklearn:
gm.bic(X)
7515.986587277572
gm.aic(X)
7373.66168418709
The following figure shows the values of BIC & AIC for different cluster numbers:
As we can see, AIC & BIC are lowest when k =3 so it is most likely the best choice.
bgm.fit(X)
BayesianGaussianMixture(covariance_prior=None, covariance_type='full',
degrees_of_freedom_prior=None,
init_params='kmeans',
max_iter=100, mean_precision_prior=None,
mean_prior=None, n_components=10, n_init=10,
random_state=None, reg_covar=1e-06, tol=0.001,
verbose=0, verbose_interval=10,
warm_start=False,
weight_concentration_prior=None,
weight_concentration_prior_type='dirichlet_process')
np.round(bgm.weights_, 2)
In this model, the cluster parameters (weights, means, co-variance matrices) are not treated as
fixed model parameters anymore, but as latent random variables, like the cluster assignments.
So z now includes both cluster parameters & cluster assignments.
The beta distribution is commonly used to model random variables whose values lie within a
fixed range. In this case, the range is from 0 to 1.
The stick-breaking process is a good model for datasets where new instances are more likely to
join large clusters than small clusters. E.g. people love to move to larger cities.
If the concentration α is high, then ϕ values will likely be close to 0, and the SBP will generate
more clusters. The wishard distribution is used to sample covariance matrices, the parameters d
and V control the distribution of cluster shapes.
Prior knowledge about the latent variables z can be encoded in a probability distribution p ( z )
called the prior. For example, we may have a prior belief that the clusters are likely to be few
(Low concentration).
Bayes' theorem tells us how to update the probability distribution over the latent variables after
we observe some data X . It computes the posterior distribution p ( z∨X ) , which is the
conditional probability of z given X .
P ( X ∨z ) P ( z )
P ( z∨X ) =
P(X)
The evidence P ( X ) is often intractable:
P ( X )= ∫ P ( X∨z ) P ( z ) d z
This intractability is one of the central problems is Bayesian Statistics, and there are several
approaches to solving it. One of them is variational inference. Which picks a famility of
distributions q ( z ; λ ) ( λ : variational parameter) then optimizes these parameters to make
q ( z ) ≈ p ( z∨ X ). This is achieved by finding λ that minimizes the KL divergence between q ( z ) and
p ( z∨X ) (Noted D K L ( q|) p )).
The KL equation can be re-written as the log of the evidence minus the evidence lower bound
(ELBO).
D K L ( q|) p ) =l o g p ( X ) − E L B O
Since the log of the evidence doesn't depend on q , minimizing KL divergence is equivalent to
minimizing the ELBO expression. In practice, there are different techniques to minimizing the
ELBO.
In mean field variational inference, it is necessary to pick a family of distributions q ( z ; λ ) and the
prior p ( z ) very carefully to ensure that the equation for the ELBO simplifies to a form that can be
computed, unfortunately, there is no general way to do this.
An opproach to maximizing the ELBO is called black box stochastic variational inference (BBSVI):
At each iteration a few samples are drawn from q and they're used to estimate the gradients of
the ELBO with regards to the variational parameters of λ , which are then used in a gradient
ascent step.
This approach makes it possible to use Bayesian inference with any differentiable system, even
deep neural networks (called bayesian Deep Learning).
Now let's take a look at other algorithms capable of dealing with arbitrary cluster shapes.
Clustering is the process of uncovering groups (or clusters) within unlabaeled data instances. It
may use distance or density measures.
3. Describe 2 techniques to select the right number of clusters when using K-Means?
Plotting the silouhette density score in relation to the chosen number of clusters then
choosing the number of clusters that maximizes the score.
The Elbow rule: By choosing the number of clusters that decreased the inretia the most.
4. What is label propagation? Why would you implement it? and How?
Label propagation is useful in the context of semi-supervised learning where we have few
labeled points and a lot of unlabeled instances and we want to propagate the labels from the
annotated samples to the unlabeled ones.
We would want to to implement it if we have a downstream supervised learning task that needs
more data.
A simple strategy would be distance-based propagation, where first we will train an algorithm to
do unsupervised clustering and then propagate the labels into the ones that correspond to the
same clusters of the annotated instances.
5. Can you name two clustering algorithms that can scale to large datasets? and two that look
for regions of high density?
DBSCAN (if ϵ is small enough), and KMeans (if dataset is comprised of clusters), and BIRCH.
6. Can you think of a use case when active learning would be useful? How would you
implement it?
It would be useful when we're having a small labeled dataset and a lot of unlabeled instances.
First we train the algorithm of the available labeled instances. Then we predict on the unlabeled
instances and give the annotator the instances that the algorithm isn't sure about. We loop until
the increase in performance in not noticeable anymore (called uncertainty sampling).
Anomaly Detection: Detecting instances that don't belong to the same data distribution of the
training data.
Novelty Detection: Detecting instances that look extreme but belong to the same distribution of
the training data.
In our case, it was used to uncover clusters in unlabeled data assuming that all of the clusters
correspond to gaussian distributions.
9. Can you find two techniques to find the right number of clusters when using a Gaussian
mixture model?
Plotting the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) as a
function of the number of clusters then chose the number that minimizes either.
Using Bayesian Gaussian Mixture Models to avoid random search and using priors instead.
10. The classic Olivetti faces dataset contains 400 gray-scale 64 × 64 -pixel images of faces.
Each image is flattened to a 1D vector of size 4 , 096. 40 different people were photographed
(10 times each), and the usual task is to train a model to predict which person is represented
in each picture. Load the dataset using sklearn.datasets.fetch_olivetti_faces()
function, then split it into training/validation and test set. Since the dataset is quite small, you
probability want to use stratefied sampling to ensure that there are the same number of
images per person in each set
of = datasets.fetch_olivetti_faces()
X, y = of.data, of.target
X.shape, y.shape
Cluster the images using KMeans, and measure that you have a good number of clusters.
Visualize the clusters: do you see similar faces in each cluster?
kms = KMeans()
kms.fit(X_train)
inretias = list()
for k in range(1, 20):
kms = KMeans(n_clusters=k)
kms.fit(X_train)
inretias.append(kms.inertia_)
tsne = TSNE()
X_train_reduced = tsne.fit_transform(X_train)
X_train_reduced.shape
(225, 2)
plt.axis('off')
plt.show()
11. Continuing with the Olivetti faces dataset, train a classifier to predict which person is
represented in each picture and evaluate it on the validation set
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=None, max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
rf.score(X_val, y_val)
0.88
Next, use KMeans as a dimensionality reduction tool, and train a classifier on the reduced set.
Search for the number of clusters that allows the classifier to get the best performance: What
performance can you reach?
plt.plot(clusters, val_accs)
plt.xlabel('clusters: k')
plt.ylabel('validation accuracy')
plt.show()
What if you append the features from the reduced set to the original features?
26 to 32 seems good (but we need bigger validation sets) with >= 91% accuracy.
plt.plot(clusters, val_accs)
plt.xlabel('clusters: k')
plt.ylabel('validation accuracy')
plt.show()
12. Train a Gaussian Mixture Model on the Olivetti faces dataset. To speed up the algorithm,
you should probably reduce the dataset's dimensionality (Use PCA, preserving 99% of the
variance)
pca = PCA(n_components=.99)
X_train_reduced = pca.fit_transform(X_train)
gm = GaussianMixture(n_components=25)
gm.fit(X_train_reduced)
GaussianMixture(covariance_type='full', init_params='kmeans',
max_iter=100,
means_init=None, n_components=25, n_init=1,
precisions_init=None, random_state=None, reg_covar=1e-
06,
tol=0.001, verbose=0, verbose_interval=10,
warm_start=False,
weights_init=None)
Use the model to generate some new faces and visualize them
generated_samples, _ = gm.sample(n_samples=100)
generated_faces = pca.inverse_transform(generated_samples)
tsne = TSNE()
generated_faces_reduced = tsne.fit_transform(generated_faces)
generated_faces_reduced.shape
(100, 2)
plt.axis('off')
plt.show()
13. Some dimensionality reduction techniques can also be used for anomaly detection. For
example, take the Olivetti dataset and reduce it with PCA, preserving 99% of the variance.
pca = PCA(n_components=.99)
X_train_reduced = pca.fit_transform(X_train)
X_train_reconstructed = pca.inverse_transform(X_train_reduced)
sns.distplot(X_train_reconstruction_error)
<matplotlib.axes._subplots.AxesSubplot at 0x1a44766e10>
anomaly_indices = X_train_reconstruction_error.argsort()[-5:][::-1]
tsne = TSNE()
X_train_reduced = tsne.fit_transform(X_train)
plt.axis('off')
plt.show()