0% found this document useful (0 votes)
8 views3 pages

Lab6 Instruction

This lab focuses on using K-Means clustering to analyze the MNIST dataset, guiding students through data loading, normalization, reshaping, and clustering. Students will compare clustering accuracy with different numbers of clusters and visualize the cluster centers. Homework and competition questions encourage reflection on model performance and data preprocessing steps.

Uploaded by

dave1304963270
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Lab6 Instruction

This lab focuses on using K-Means clustering to analyze the MNIST dataset, guiding students through data loading, normalization, reshaping, and clustering. Students will compare clustering accuracy with different numbers of clusters and visualize the cluster centers. Homework and competition questions encourage reflection on model performance and data preprocessing steps.

Uploaded by

dave1304963270
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

BU.330.

775 Machine Learning: Design and Deployment


Lab 6. Image Clustering using K-Means

Learning Goal: practice using unsupervised machine learning model to cluster image data

Background: We will use the MNIST dataset for this lab. Please refer to Lab 3 instructions for
information about the MNIST dataset.

a. Import the required packages.


from keras.datasets import mnist
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np

b. First let’s load the MNIST dataset and check the size of the dataset, namely the number of
training images, number of testing images, size of each image, and the minimun and
maximum values of training data.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape)
print(x_test.shape)
print(x_train.min())
print(x_train.max())

c. Then we will plot 9 sample images from the dataset.


plt.gray() # B/W Images
plt.figure(figsize = (10,9)) # Adjusting figure size
# Displaying a grid of 3x3 images
for i in range(9):
plt.subplot(3,3,i+1)
plt.imshow(x_train[i])

d. We convert the data to float type, and normalize the vectors from 0-255 to range 0-1 for
computation efficiency. We will check the minimum and maximum values again after
normalization.
# Conversion to float
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalization
x_train = x_train/255.0
x_test = x_test/255.0
# Checking the minimum and maximum values of x_train
print(x_train.min())
print(x_train.max())
e. The original input data is 3 dimensions: (60000, 28, 28) for training data, and (10000, 28, 28)
for testing data. We need to convert it to 2 dimensional format for K-means clustering
algorithm. After reshaping the data, the dimensions for training data will be (60000, 784) and
for testing data (10000, 784), since 28x28=784.
# Reshaping input data
X_train = x_train.reshape(len(x_train),-1)
X_test = x_test.reshape(len(x_test),-1)
# Checking the shape
print(X_train.shape)
print(X_test.shape)

f. Now we are ready to apply the K-means. First, we will define a help function to map cluster
labels to the most frequent class labels (from y_train) in that cluster. Then we will initialize
the K-means model with 10 clusters, and we use minibatch version of K-Means.
def retrieve_info(cluster_labels,y_train):
# Initializing
reference_labels = {}
# For loop to run through each label of cluster label
for i in range(len(np.unique(kmeans.labels_))):
index = np.where(cluster_labels == i,1,0)
num = np.bincount(y_train[index==1]).argmax()
reference_labels[i] = num
return reference_labels

total_clusters = len(np.unique(y_train))
# Initialize the K-Means model
kmeans = MiniBatchKMeans(n_clusters = total_clusters)
# Fitting the model to training set
kmeans.fit(X_train)

g. After that, we can retrieve the labels and let’s compare the first 20 labels, that is, comparing
our K-means prediction with the true label.
reference_labels = retrieve_info(kmeans.labels_,y_train)
number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
number_labels[i] = reference_labels[kmeans.labels_[i]]

# Comparing Predicted values and Actual values


print(number_labels[:20].astype('int'))
print(y_train[:20])

h. We can calculate the overall accuracy score.


# Calculating accuracy score
print(accuracy_score(number_labels,y_train))

i. Now let’s increase the number of clusters (the k value) to 50, and check whether the accuracy
improves.
# Increase to 50 clusters, and fit the model
kmeans = MiniBatchKMeans(n_clusters = 50)
kmeans.fit(X_train)

# Calculating the reference_labels


reference_labels = retrieve_info(kmeans.labels_,y_train)
# ‘number_labels’ is a list which denotes the number displayed in image
number_labels = np.random.rand(len(kmeans.labels_))
for i in range(len(kmeans.labels_)):
number_labels[i] = reference_labels[kmeans.labels_[i]]
print('Accuracy score : {}'.format(accuracy_score(number_labels,y_train)))
print('\n')

j. Finally, we can visualize the cluster centers to get a better idea about the algorithm.
# Cluster centroids is stored in ‘centroids’
centroids = kmeans.cluster_centers_
centroids.shape
centroids = centroids.reshape(50,28,28)
centroids = centroids * 255
plt.figure(figsize = (10,10))
bottom = 0.35
for i in range(50):
plt.subplots_adjust(bottom)
plt.subplot(5,10,i+1)
plt.title('Num:{}'.format(reference_labels[i]),fontsize = 10)
plt.imshow(centroids[i])

Homework Question 1 (1pt): Compare the accuracy of 10 clusters vs that of 50 clusters, which
one is better?
Homework Question 2 (1pt): Inspect the centroids in step j, discuss why increasing the number
of clusters in this case has a positive/negative impact on the model performance.
Homework Question 3 (1pt): Comment on the performance of K-means in MNIST image
clustering. What insight(s) can we draw?
Competition Question 1 (2pt): Describe your steps including data preprocessing and modeling
approaches.
Competition Question 2 (2pt): Evaluate your model performance compared to the baseline
model.

Submission: Complete and submit on Canvas by the beginning of Class 7. Use


homework6_yourname.ipynb, and Competition_yourname.ipynb, respectively, as the file names.

Reference:
https://medium.com/@joel_34096/k-means-clustering-for-image-classification-a648f28bdc47

You might also like