NB 14
NB 14
• Lloyd’s Algorithm
re ad a
joint x nearest
cluster
Mr
tart with initialguess forcentroids data
points are assigned
to the closestcentroid
adatyaset
t 1
A B C
D E
J T
t the centroid is
recalculated
stepsDSEare repeateduntil
a convergence to alocal
constantly minimum
a m y gang
Clustering via -means
We previously studied the classification problem using the logistic regression algorithm. Since we had labels for each data point, we may regard
as one of supervised learning. However, in many applications, the data have no labels but we wish to discover possible labels (or other hidden
structures). This problem is one of unsupervised learning. How can we approach such problems?
Clustering is one class of unsupervised learning methods. In this lab, we'll consider the following form of the clustering task. Suppose you are
disjoint, i.e., ;
but also complete, i.e., .
Intuitively, each cluster should reflect some "sensible" grouping. Thus, we need to specify what constitutes such a grouping.
Setup: Dataset
The following cell will download the data you'll need for this lab. Run it now.
def on_vocareum():
return os.path.exists('.voc')
if on_vocareum():
URL_BASE = "https://cse6040.gatech.edu/datasets/kmeans/"
DATA_PATH = "../resource/asnlib/publicdata/"
else:
URL_BASE = "https://github.com/cse6040/labs-fa17/raw/master/datasets/kmeans/"
DATA_PATH = ""
'logreg_points_train.csv' is ready!
'y_test3.npy' is ready!
'compute_d2_soln.npy' is ready!
'assign_cluster_labels_soln.npy' is ready!
'centers_test3_soln.npy' is ready!
'assign_cluster_labels_S.npy' is ready!
'centers initial testing npy' is ready!
centers_initial_testing.npy is ready!
where is the center of . This center may be computed simply as the mean of all points in , i.e.,
Then, our objective is to find the "best" clustering, , which is the one that has a minimum WCSS.
Step 1: Assignment. Given a fixed set of centers, assign each point to the nearest center:
Step 2: Update. Recompute the centers ("centroids") by averaging all the data points belonging to each cluster, i.e., taking their mean:
In the code that follows, it will be convenient to use our usual "data matrix" convention, that is, each row of a data matrix is one of observa
each column (coordinate) is one of predictors. However, we will not need a dummy column of ones since we are not fitting a function.
%matplotlib inline
%matplotlib inline
We will use the following data set which some of you may have seen previously.
In [3]: df = pd.read_csv('{}logreg_points_train.csv'.format(DATA_PATH))
df.head()
Out[3]:
x_1 x_2 label
0 -0.234443 -1.075960 1
1 0.730359 -0.918093 0
2 1.432270 -0.439449 0
3 0.026733 1.050300 0
4 1.879650 0.207743 0
a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]
a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]
In [5]: make_scatter_plot(df)
_ _
usedinalgorithmforverificationonly
pivot
Let's extract the data points as a data matrix, points, and the labels as a vector, labels. Note that the k-means algorithm you will implement
reference labels -- that's the solution we will try to predict given only the point coordinates (points) and target number of clusters (k).
Note that the labels should not be used in the -means algorithm. We use them here only as ground truth for later verification.
Exercise 1 (2 points). Complete the following function, init_centers(X, k), so that it randomly selects of the given observations to serve
should return a Numpy array of size k-by-d, where d is the number of columns of X.
centers_initial = init_centers(points, k)
print("Initial centers:\n", centers_initial)
assert type(centers_initial) is np.ndarray, "Your function should return a Numpy array instead o
rmat(type(centers_initial))
assert centers_initial.shape == (k, d), "Returned centers do not have the right shape ({} x {})"
d)
assert (sum(centers_initial[0, :] == points) == [1, 1]).all(), "The centers must come from the i
assert (sum(centers_initial[1, :] == points) == [1, 1]).all(), "The centers must come from the i
print("\n(Passed!)")
Initial centers:
[[ 0.428191 -1.9734 ]
[ 0.75525 2.03587 ]]
(Passed!)
return S
centers_initial_testing = np.load("{}centers_initial_testing.npy".format(DATA_PATH))
compute_d2_soln = np.load("{}compute_d2_soln.npy".format(DATA_PATH))
print("\n(Passed!)")
(Passed!)
Exercise 3 (2 points). Write a function that uses the (squared) distance matrix to assign a "cluster label" to each point.
That is, consider the squared distance matrix . For each point , if is the minimum squared distance for point , then the index is 'scluster
In other words, your function should return a (column) vector of length such that
s is
sisistheminimumsquareddistanceforpointi
Hint: Judicious use of Numpy's argmin() (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html) makes for a nic
line solution.
mxxsquareddistance
In [11]: def assign_cluster_labels(S): f matrix
### BEGIN SOLUTION
return np.argmin(S, axis=1)
### END SOLUTION
# Cluster labels: 0 1
S_test1 = np.array([[0.3, 0.2], # --> cluster 1
[0.1, 0.5], # --> cluster 0
[0.4, 0.2]]) # --> cluster 1
y_test1 = assign_cluster_labels(S_test1)
print("You found:", y_test1)
You found: [1 0 1]
S_test2 = np.load("{}assign_cluster_labels_S.npy".format(DATA_PATH))
y_test2_soln = np.load("{}assign_cluster_labels_soln.npy".format(DATA_PATH))
y_test2 = assign_cluster_labels(S_test2)
assert (y_test2 == y_test2_soln).all()
print("\n(Passed!)")
(Passed!)
Exercise 4 (2 points). Given a clustering (i.e., a set of points and assignment of labels), compute the center of each cluster.
y_test3 = np.load("{}y_test3.npy".format(DATA_PATH))
centers_test3_soln = np.load("{}centers_test3_soln.npy".format(DATA_PATH))
t t t3 d t t ( i t t t3)
centers_test3 = update_centers(points, y_test3)
print("\n(Passed!)")
(Passed!)
Exercise 5 (2 points). Given the squared distances, return the within-cluster sum of squares.
def WCSS(S):
...
S = np.array([[0.3, 0.2],
[0.1, 0.5],
[0.4, 0.2]])
# Quick test:
print("S ==\n", S_test1)
WCSS_test1 = WCSS(S_test1)
print("\nWCSS(S) ==", WCSS(S_test1))
S ==
[[0.3 0.2]
[0.1 0.5]
[0.4 0.2]]
WCSS(S) == 0.5
(Passed!)
Lastly, here is a function to check whether the centers have "moved," given two instances of the center values. It accounts for the fact that theorder
centers may have changed.
a
Exercise 6 (3 points). Put all of the preceding building blocks together to implement Lloyd's -means algorithm.
converged = False
labels = np.zeros(len(X))
i = 1
while (not converged) and (i <= max_steps):
old_centers = centers
an
given set
initial and
ofcenters the distane
s quare
### BEGIN SOLUTION
S = compute_d2(X, centers) for the pantsxto the centers
ex
labels = assign_cluster_labels(S) assign toaaustere
points ex
centers = update_centers(X, labels) given theauster the
recalculate centroid
for mea user cen
converged = has_converged(old_centers, centers) check whether thecentershavemoved
### END SOLUTION aexsa
print ("iteration", i, "WCSS = ", WCSS (S))
i + 1
i += 1
return labels
df['clustering'] = clustering
centers = update_centers(points, clustering)
make_scatter_plot(df, hue='clustering', centers=centers)
Applying k-means to an image. In this section of the notebook, you will apply k-means to an image, for the purpose of doing a "stylized recol
(You can view this example as a primitive form of artistic style transfer (http://genekogan.com/works/style-transfer/), which state-of-the-art met
accomplish using neural networks (https://medium.com/artists-and-machine-intelligence/neural-artistic-style-transfer-a-comprehensive-look-f5
In particlar, let's take an input image and cluster pixels based on the similarity of their colors. Maybe it can become the basis of your own Instag
(https://blog.hubspot.com/marketing/instagram-filters)!
def read_img(path):
"""
Read image and store it as an array, given the image path.
Returns the 3 dimensional image array.
"""
img = Image.open(path)
img_arr = np.array(img, dtype='int32')
img.close()
return img_arr
def display_image(arr):
"""
display the image
input : 3 dimensional array
"""
arr = arr.astype(dtype='uint8')
img = Image.fromarray(arr, 'RGB')
imshow(np.asarray(img))
img_arr = read_img("../resource/asnlib/publicdata/football.bmp")
display_image(img_arr)
print("Shape of the matrix obtained by reading the image")
print(img_arr.shape)
Note that the image is stored as a "3-D" matrix. It is important to understand how matrices help to store a image. Each pixel corresponds to a i
for Red, Green and Blue. If you note the properties of the image, its resolution is 620 x 412. The image width is 620 pixels and height is 412 pix
pixel has three values - R, G, B. This makes it a 412 x 620 x 3 matrix.
Exercise 7 (1 point). Write some code to reshape the matrix into "img_reshaped" by transforming "img_arr" from a "3-D" matrix to a flattened "
which has 3 columns corresponding to the RGB values for each pixel. In this form, the flattened matrix must contain all pixels and their corresp
intensity values. Remember in the previous modules we had discussed a C type indexing style and a Fortran type indexing style. In this problem
C type indexing style. The numpy reshape function may be of help here.
Passed
Exercise 8 (1 point). Now use the k-means function that you wrote above to divide the image in 3 clusters. The result would be a vector which
label to each pixel.
Exercise 9 (2 points). Write code to calculate the mean of each cluster and store it in a dictionary as label:array(cluster_center). For 3 clusters, t
should have three keys as the labels and their corresponding cluster centers as values, i.e. {0:array(center0), 1: array(center1), 2:array(center2)}
In [26]: print("Free points here! But you need to implement the above section correctly for you to see wh
you to see later.")
print("\nPassed!")
Free points here! But you need to implement the above section correctly for you to see what we w
o see later.
Passed!
Below, we have written code to generate a matrix "img_clustered" of the same dimensions as img_reshaped, where each pixel is replaced by th
center to which it belongs.
Let us display the clustered image and see how kmeans works on the image.
In [28]: r, c, l = img_arr.shape
img_disp = np.reshape(img_clustered, (r, c, l), order="C")
display_image(img_disp)
You can visually inspect the original image and the clustered image to get a sense of what kmeans is doing here. You can also try to vary the nu
clusters to see how the output image changes
Built-in -means
The preceding exercises walked you through how to implement -means, but as you might have imagined, there are existing implementations a
following shows you how to use Scipy's implementation, which should yield similar results. If you are asked to use -means in a future lab (or e
can use this one.
print("Centers:\n", centers_vq)
print("\nCompare with your method:\n", centers, "\n")
print("Distortion (WCSS):", distortion_vq)
df['clustering_vq'] = clustering_vq
make_scatter_plot(df, hue='clustering_vq', centers=centers_vq)
Centers:
[[-0.37382602 -1.18565619]
[ 0.64980076 0.4667703 ]]
Fin! That marks the end of this notebook. Don't forget to submit it!