Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
The two unsupervised learning tasks we will explore are clustering the data
into groups by similarity and reducing dimensionality to compress the data
while maintaining its structure and usefulness.
Clustering
An interesting example of clustering in the real world is marketing data
provider Acxiom’s life stage clustering system, Personicx. This service
segments U.S. households into 70 distinct clusters within 21 life stage groups
that are used by advertisers when targeting Facebook ads, display ads, direct
mail campaigns, etc.
Their white paper reveals that they used centroid clustering and principal
component analysis, both of which are techniques covered in this section.
You can imagine how having access to these clusters is extremely useful for
advertisers who want to (1) understand their existing customer base and (2)
use their ad spend effectively by targeting potential new customers with
relevant demographics, interests, and lifestyles.
You can actually find out which cluster you personally would belong to by answering a few simple questions in
Acxiom’s “What’s My Cluster?” tool.
k-means clustering
“And k rings were given to the race of Centroids, who above all else, desire power.”
The goal of clustering is to create groups of data points such that points in
different clusters are dissimilar while points within a cluster are similar.
With k-means clustering, we want to cluster our data points into k groups. A
larger k creates smaller groups with more granularity, a lower k means larger
groups and less granularity.
The output of the algorithm would be a set of “labels” assigning each data
point to one of the k groups. In k-means clustering, the way these groups are
defined is by creating a centroid for each group. The centroids are like the
heart of the cluster, they “capture” the points closest to them and add them
to the cluster.
Think of these as the people who show up at a party and soon become the
centers of attention because they’re so magnetic. If there’s just one of them,
everyone will gather around; if there are lots, many smaller centers of
activity will form.
That, in short, is how k-means clustering works! Check out this visualization
of the algorithm — read it like a comic book. Each point in the plane is
colored according the centroid that it is closest to at each moment. You’ll
notice that the centroids (the larger blue, red, and green circles) start
randomly and then quickly adjust to capture their respective clusters.
Another real-life application of k-means clustering is classifying handwritten
digits. Suppose we have images of the digits as a long vector of pixel
brightnesses. Let’s say the images are black and white and are 64x64 pixels.
Each pixel represents a dimension. So the world these images live in has
64x64=4,096 dimensions. In this 4,096-dimensional world, k-means
clustering allows us to group the images that are close together and assume
they represent the same digit, which can achieve pretty good results for digit
recognition.
Hierarchical clustering
“Let’s make a million options become seven options. Or five. Or twenty? Meh, we
can decide later.”
Dimensionality reduction
“It is not the daily increase, but the daily decrease. Hack away at the unessential.”
— Bruce Lee
You’re familiar with the coordinate plane with origin O(0,0) and basis vectors
i(1,0) and j(0,1). It turns out you can choose a completely different basis and
still have all the math work out. For example, you can keep O as the origin
and choose the basis to vectors i’=(2,1) and j’=(1,2). If you have the patience
for it, you’ll convince yourself that the point labeled (2,2) in the i’, j’
coordinate system is labeled (6, 6) in the i, j system.
Plotted using Mathisfun’s “Interactive Cartesian Coordinates”
This means we can change the basis of a space. Now imagine much higher-
dimensional space. Like, 50K dimensions. You can select a basis for that
space, and then select only the 200 most significant vectors of that basis.
These basis vectors are called principal components, and the subset you
select constitute a new space that is smaller in dimensionality than the
original space but maintains as much of the complexity of the data as
possible.
Another way of thinking about this is that PCA remaps the space in which
our data exists to make it more compressible. The transformed dimension is
smaller than the original dimension.
By making use of the first several dimensions of the remapped space only,
we can start gaining an understanding of the dataset’s organization. This is
the promise of dimensionality reduction: reduce complexity (dimensionality
in this case) while maintaining structure (variance). Here’s a fun paper
Samer wrote on using PCA (and diffusion mapping, another technique) to try
to make sense of the Wikileaks cable release.
To examine what that means more precisely, let’s work with this image of a
dog:
We’ll use the code written in Andrew Gibiansky’s post on SVD. First, we show
that if we rank the singular values (the values of the matrix Σ) by magnitude,
the first 50 singular values contain 85% of the magnitude of the whole matrix
Σ.
We can use this fact to discard the next 250 values of sigma (i.e., set them to
0) and just keep a “rank 50” version of the image of the dog. Here, we create a
rank 200, 100, 50, 30, 20, 10, and 3 dog. Obviously, the picture is smaller, but
let’s agree that the rank 30 dog is still good. Now let’s see how much
compression we achieve with this dog. The original image matrix is 305*275
= 83,875 values. The rank 30 dog is 305*30+30+30*275=17,430 — almost 5
times fewer values with very little loss in image quality. The reason for the
calculation above is that we also discard the parts of the matrix U and V that
get multiplied by zeros when the operation UΣ’V is carried out (where Σ’ is
the modified version of Σ that only has the first 30 values in it).
Unsupervised learning is often used to preprocess the data. Usually, that
means compressing it in some meaning-preserving way like with PCA or
SVD before feeding it to a deep neural net or another supervised learning
algorithm.
Onwards!
Now that you’ve finished this section, you’ve earned an awful, horrible,
never-to-be-mentioned-again joke about unsupervised learning. Here goes…
Person-in-joke-#2: Y? there’s no Y.
3a — k-means clustering
Play around with this clustering visualization to build intuition for how the
algorithm works. Then, take a look at this implementation of k-means clustering
for handwritten digits and the associated tutorial.
3b — SVD
For a good reference on SVD, go no further than Andrew Gibiansky’s post.
Enter your email below if you’d like to stay up-to-date with future content
💌
On Twitter? So are we. Feel free to keep in touch — Vishal and Samer 🙌🏽
Tech