Tutorial 8
Tutorial 8
ratings =␣
,→[['john',5,5,2,1],['mary',4,5,3,2],['bob',4,4,4,3],['lisa',2,2,4,5],['lee',1,2,3,4],['harry'
1
In this example dataset, the first 3 users liked action movies (Jaws and Star Wars) while the last 3
users enjoyed horror movies (Exorcist and Omen). Our goal is to apply k-means clustering on the
users to identify groups of users with similar movie preferences.
The example below shows how to apply k-means clustering (with k=2) on the movie ratings data.
We must remove the “user” column first before applying the clustering algorithm. The cluster
assignment for each user is displayed as a dataframe object.
data = movies.drop('user',axis=1)
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means.fit(data)
labels = k_means.labels_
pd.DataFrame(labels, index=movies.user, columns=['Cluster ID'])
[2]: Cluster ID
user
john 1
mary 1
bob 1
lisa 0
lee 0
harry 0
The k-means clustering algorithm assigns the first three users to one cluster and the last three users
to the second cluster. The results are consistent with our expectation. We can also display the
centroid for each of the two clusters.
Observe that cluster 0 has higher ratings for the horror movies whereas cluster 1 has higher ratings
for action movies. The cluster centroids can be applied to other users to determine their cluster
assignments.
testData = np.array([[4,5,1,2],[3,2,4,4],[2,3,4,1],[3,2,3,3],[5,4,1,4]])
labels = k_means.predict(testData)
labels = labels.reshape(-1,1)
usernames = np.array(['paul','kim','liz','tom','bill']).reshape(-1,1)
cols = movies.columns.tolist()
cols.append('Cluster ID')
2
newusers = pd.DataFrame(np.concatenate((usernames, testData, labels),␣
,→axis=1),columns=cols)
newusers
To determine the number of clusters in the data, we can apply k-means with varying number of
clusters from 1 to 6 and compute their corresponding sum-of-squared errors (SSE) as shown in the
example below. The “elbow” in the plot of SSE versus number of clusters can be used to estimate
the number of clusters.
numClusters = [1,2,3,4,5,6]
SSE = []
for k in numClusters:
k_means = cluster.KMeans(n_clusters=k)
k_means.fit(data)
SSE.append(k_means.inertia_)
plt.plot(numClusters, SSE)
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
[5]: Text(0,0.5,'SSE')
3
1.2 8.2 Hierarchical Clustering
This section demonstrates examples of applying hierarchical clustering to the vertebrate dataset
used in Module 6 (Classification). Specifically, we illustrate the results of using 3 hierarchical
clustering algorithms provided by the Python scipy library: (1) single link (MIN), (2) complete
link (MAX), and (3) group average. Other hierarchical clustering algorithms provided by the
library include centroid-based and Ward’s method.
data = pd.read_csv('vertebrate.csv',header='infer')
data
4
11 penguin 1 0 1
12 porcupine 1 1 0
13 eel 0 0 1
14 salamander 0 0 1
names = data['Name']
Y = data['Class']
X = data.drop(['Name','Class'],axis=1)
Z = hierarchy.linkage(X.as_matrix(), 'single')
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation='right')
5
1.2.2 8.2.2 Complete Link (MAX)
6
1.2.3 8.3.3 Group Average
7
We apply the DBScan clustering algorithm on the data by setting the neighborhood radius (eps)
to 15.5 and minimum number of points (min_samples) to be 5. The clusters are assigned to IDs
between 0 to 8 while the noise points are assigned to a cluster ID equals to -1.
db = DBSCAN(eps=15.5, min_samples=5).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = pd.DataFrame(db.labels_,columns=['Cluster ID'])
result = pd.concat((data,labels), axis=1)
result.plot.scatter(x='x',y='y',c='Cluster ID', colormap='jet')
8
1.4 8.4 Spectral Clustering
One of the main limitations of the k-means clustering algorithm is its tendency to seek for globular-
shaped clusters. Thus, it does not work when applied to datasets with arbitrary-shaped clusters
or when the cluster centroids overlapped with one another. Spectral clustering can overcome this
limitation by exploiting properties of the similarity graph to overcome such limitations. To illustrate
this, consider the following two-dimensional datasets.
9
Below, we demonstrate the results of applying k-means to the datasets (with k=2).
10
The plots above show the poor performance of k-means clustering. Next, we apply spectral clus-
tering to the datasets. Spectral clustering converts the data into a similarity graph and applies the
normalized cut graph partitioning algorithm to generate the clusters. In the example below, we
use the Gaussian radial basis function as our affinity (similarity) measure. Users need to tune the
kernel parameter (gamma) value in order to obtain the appropriate clusters for the given dataset.
spectral = cluster.
,→SpectralClustering(n_clusters=2,random_state=1,affinity='rbf',gamma=5000)
spectral.fit(data1)
labels1 = pd.DataFrame(spectral.labels_,columns=['Cluster ID'])
result1 = pd.concat((data1,labels1), axis=1)
spectral2 = cluster.
,→SpectralClustering(n_clusters=2,random_state=1,affinity='rbf',gamma=100)
spectral2.fit(data2)
labels2 = pd.DataFrame(spectral2.labels_,columns=['Cluster ID'])
result2 = pd.concat((data2,labels2), axis=1)
11
1.5 8.5 Summary
This tutorial illustrates examples of using different Python’s implementation of clustering algo-
rithms. Algorithms such as k-means, spectral clustering, and DBScan are designed to create dis-
joint partitions of the data whereas the single-link, complete-link, and group average algorithms
are designed to generate a hierarchy of cluster partitions.
References: [1] George Karypis, Eui-Hong Han, and Vipin Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. IEEE Computer 32(8): 68-75, 1999.
12