Lab6 Instruction
Lab6 Instruction
Learning Goal: practice using unsupervised machine learning model to cluster image data
Background: We will use the MNIST dataset for this lab. Please refer to Lab 3 instructions for
information about the MNIST dataset.
b. First let’s load the MNIST dataset and check the size of the dataset, namely the number of
training images, number of testing images, size of each image, and the minimun and
maximum values of training data.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape)
print(x_test.shape)
print(x_train.min())
print(x_train.max())
d. We convert the data to float type, and normalize the vectors from 0-255 to range 0-1 for
computation efficiency. We will check the minimum and maximum values again after
normalization.
# Conversion to float
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalization
x_train = x_train/255.0
x_test = x_test/255.0
# Checking the minimum and maximum values of x_train
print(x_train.min())
print(x_train.max())
e. The original input data is 3 dimensions: (60000, 28, 28) for training data, and (10000, 28, 28)
for testing data. We need to convert it to 2 dimensional format for K-means clustering
algorithm. After reshaping the data, the dimensions for training data will be (60000, 784) and
for testing data (10000, 784), since 28x28=784.
# Reshaping input data
X_train = x_train.reshape(len(x_train),-1)
X_test = x_test.reshape(len(x_test),-1)
# Checking the shape
print(X_train.shape)
print(X_test.shape)
f. Now we are ready to apply the K-means. First, we will define a help function to map cluster
labels to the most frequent class labels (from y_train) in that cluster. Then we will initialize
the K-means model with 10 clusters, and we use minibatch version of K-Means.
def retrieve_info(cluster_labels,y_train):
# Initializing
reference_labels = {}
# For loop to run through each label of cluster label
for i in range(len(np.unique(kmeans.labels_))):
index = np.where(cluster_labels == i,1,0)
num = np.bincount(y_train[index==1]).argmax()
reference_labels[i] = num
return reference_labels
total_clusters = len(np.unique(y_train))
# Initialize the K-Means model
kmeans = MiniBatchKMeans(n_clusters = total_clusters)
# Fitting the model to training set
kmeans.fit(X_train)
g. After that, we can retrieve the labels and let’s compare the first 20 labels, that is, comparing
our K-means prediction with the true label.
reference_labels = retrieve_info(kmeans.labels_,y_train)
number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
number_labels[i] = reference_labels[kmeans.labels_[i]]
i. Now let’s increase the number of clusters (the k value) to 50, and check whether the accuracy
improves.
# Increase to 50 clusters, and fit the model
kmeans = MiniBatchKMeans(n_clusters = 50)
kmeans.fit(X_train)
j. Finally, we can visualize the cluster centers to get a better idea about the algorithm.
# Cluster centroids is stored in ‘centroids’
centroids = kmeans.cluster_centers_
centroids.shape
centroids = centroids.reshape(50,28,28)
centroids = centroids * 255
plt.figure(figsize = (10,10))
bottom = 0.35
for i in range(50):
plt.subplots_adjust(bottom)
plt.subplot(5,10,i+1)
plt.title('Num:{}'.format(reference_labels[i]),fontsize = 10)
plt.imshow(centroids[i])
Homework Question 1 (1pt): Compare the accuracy of 10 clusters vs that of 50 clusters, which
one is better?
Homework Question 2 (1pt): Inspect the centroids in step j, discuss why increasing the number
of clusters in this case has a positive/negative impact on the model performance.
Homework Question 3 (1pt): Comment on the performance of K-means in MNIST image
clustering. What insight(s) can we draw?
Competition Question 1 (2pt): Describe your steps including data preprocessing and modeling
approaches.
Competition Question 2 (2pt): Evaluate your model performance compared to the baseline
model.
Reference:
https://medium.com/@joel_34096/k-means-clustering-for-image-classification-a648f28bdc47