0% found this document useful (0 votes)

10 views12 pages

Tutorial 8

This document provides examples of different clustering techniques using Python, including k-means clustering, hierarchical clustering, and density-based clustering. K-means clustering is applied to a movie ratings dataset to identify groups of users with similar preferences. Hierarchical clustering with single, complete, and average linkage methods is demonstrated on a vertebrate dataset. Finally, density-based clustering using DBSCAN is introduced.

Uploaded by

POEASO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views12 pages

Tutorial 8

Uploaded by

POEASO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

tutorial8

November 20, 2020

1 Module 8: Cluster Analysis

The following tutorial contains Python examples for solving classification problems. You should
refer to Chapters 7 and 8 of the “Introduction to Data Mining” book to understand some of the
concepts introduced in this tutorial.
Cluster analysis seeks to partition the input data into groups of closely related instances so that
instances that belong to the same cluster are more similar to each other than to instances that
belong to other clusters. In this tutorial, we will provide examples of using different clustering
techniques provided by the scikit-learn library package.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding
cell and press the SHIFT-ENTER keys simultaneously.

1.1 8.1 K-means Clustering

The k-means clustering algorithm represents each cluster by its corresponding cluster centroid. The
algorithm would partition the input data into k disjoint clusters by iteratively applying the following
two steps: 1. Form k clusters by assigning each instance to its nearest centroid. 2. Recompute the
centroid of each cluster.
In this section, we perform k-means clustering on a toy example of movie ratings dataset. We first
create the dataset as follows.

[1]: import pandas as pd

ratings =␣
,→[['john',5,5,2,1],['mary',4,5,3,2],['bob',4,4,4,3],['lisa',2,2,4,5],['lee',1,2,3,4],['harry'

titles = ['user','Jaws','Star Wars','Exorcist','Omen']

movies = pd.DataFrame(ratings,columns=titles)
movies

[1]: user Jaws Star Wars Exorcist Omen

0 john 5 5 2 1
1 mary 4 5 3 2
2 bob 4 4 4 3
3 lisa 2 2 4 5
4 lee 1 2 3 4
5 harry 2 1 5 5

1
In this example dataset, the first 3 users liked action movies (Jaws and Star Wars) while the last 3
users enjoyed horror movies (Exorcist and Omen). Our goal is to apply k-means clustering on the
users to identify groups of users with similar movie preferences.
The example below shows how to apply k-means clustering (with k=2) on the movie ratings data.
We must remove the “user” column first before applying the clustering algorithm. The cluster
assignment for each user is displayed as a dataframe object.

[2]: from sklearn import cluster

data = movies.drop('user',axis=1)
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means.fit(data)
labels = k_means.labels_
pd.DataFrame(labels, index=movies.user, columns=['Cluster ID'])

[2]: Cluster ID
user
john 1
mary 1
bob 1
lisa 0
lee 0
harry 0

The k-means clustering algorithm assigns the first three users to one cluster and the last three users
to the second cluster. The results are consistent with our expectation. We can also display the
centroid for each of the two clusters.

[3]: centroids = k_means.cluster_centers_

pd.DataFrame(centroids,columns=data.columns)

[3]: Jaws Star Wars Exorcist Omen

0 1.666667 1.666667 4.0 4.666667
1 4.333333 4.666667 3.0 2.000000

Observe that cluster 0 has higher ratings for the horror movies whereas cluster 1 has higher ratings
for action movies. The cluster centroids can be applied to other users to determine their cluster
assignments.

[4]: import numpy as np

testData = np.array([[4,5,1,2],[3,2,4,4],[2,3,4,1],[3,2,3,3],[5,4,1,4]])
labels = k_means.predict(testData)
labels = labels.reshape(-1,1)
usernames = np.array(['paul','kim','liz','tom','bill']).reshape(-1,1)
cols = movies.columns.tolist()
cols.append('Cluster ID')

2
newusers = pd.DataFrame(np.concatenate((usernames, testData, labels),␣
,→axis=1),columns=cols)

newusers

[4]: user Jaws Star Wars Exorcist Omen Cluster ID

0 paul 4 5 1 2 1
1 kim 3 2 4 4 0
2 liz 2 3 4 1 1
3 tom 3 2 3 3 0
4 bill 5 4 1 4 1

To determine the number of clusters in the data, we can apply k-means with varying number of
clusters from 1 to 6 and compute their corresponding sum-of-squared errors (SSE) as shown in the
example below. The “elbow” in the plot of SSE versus number of clusters can be used to estimate
the number of clusters.

[5]: import matplotlib.pyplot as plt

%matplotlib inline

numClusters = [1,2,3,4,5,6]
SSE = []
for k in numClusters:
k_means = cluster.KMeans(n_clusters=k)
k_means.fit(data)
SSE.append(k_means.inertia_)

plt.plot(numClusters, SSE)
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')

[5]: Text(0,0.5,'SSE')

3
1.2 8.2 Hierarchical Clustering
This section demonstrates examples of applying hierarchical clustering to the vertebrate dataset
used in Module 6 (Classification). Specifically, we illustrate the results of using 3 hierarchical
clustering algorithms provided by the Python scipy library: (1) single link (MIN), (2) complete
link (MAX), and (3) group average. Other hierarchical clustering algorithms provided by the
library include centroid-based and Ward’s method.

[6]: import pandas as pd

data = pd.read_csv('vertebrate.csv',header='infer')
data

[6]: Name Warm-blooded Gives Birth Aquatic Creature \

0 human 1 1 0
1 python 0 0 0
2 salmon 0 0 1
3 whale 1 1 1
4 frog 0 0 1
5 komodo 0 0 0
6 bat 1 1 0
7 pigeon 1 0 0
8 cat 1 1 0
9 leopard shark 0 1 1
10 turtle 0 0 1

4
11 penguin 1 0 1
12 porcupine 1 1 0
13 eel 0 0 1
14 salamander 0 0 1

Aerial Creature Has Legs Hibernates Class

0 0 1 0 mammals
1 0 0 1 reptiles
2 0 0 0 fishes
3 0 0 0 mammals
4 0 1 1 amphibians
5 0 1 0 reptiles
6 1 1 1 mammals
7 1 1 0 birds
8 0 1 0 mammals
9 0 0 0 fishes
10 0 1 0 reptiles
11 0 1 0 birds
12 0 1 1 mammals
13 0 0 0 fishes
14 0 1 1 amphibians

1.2.1 8.2.1 Single Link (MIN)

[7]: from scipy.cluster import hierarchy

import matplotlib.pyplot as plt
%matplotlib inline

names = data['Name']
Y = data['Class']
X = data.drop(['Name','Class'],axis=1)
Z = hierarchy.linkage(X.as_matrix(), 'single')
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation='right')

5
1.2.2 8.2.2 Complete Link (MAX)

[8]: Z = hierarchy.linkage(X.as_matrix(), 'complete')

dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation='right')

6
1.2.3 8.3.3 Group Average

[9]: Z = hierarchy.linkage(X.as_matrix(), 'average')

dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation='right')

1.3 8.3 Density-Based Clustering

Density-based clustering identifies the individual clusters as high-density regions that are separated
by regions of low density. DBScan is one of the most popular density based clustering algorithms. In
DBScan, data points are classified into 3 types—core points, border points, and noise points—based
on the density of their local neighborhood. The local neighborhood density is defined according to 2
parameters: radius of neighborhood size (eps) and minimum number of points in the neighborhood
(min_samples).
For this approach, we will use a noisy, 2-dimensional dataset originally created by Karypis et al. [1]
for evaluating their proposed CHAMELEON algorithm. The example code shown below will load
and plot the distribution of the data.

[10]: import pandas as pd

data = pd.read_csv('chameleon.data', delimiter=' ', names=['x','y'])

data.plot.scatter(x='x',y='y')

[10]: <matplotlib.axes._subplots.AxesSubplot at 0x1d5b0a1eb00>

7
We apply the DBScan clustering algorithm on the data by setting the neighborhood radius (eps)
to 15.5 and minimum number of points (min_samples) to be 5. The clusters are assigned to IDs
between 0 to 8 while the noise points are assigned to a cluster ID equals to -1.

[11]: from sklearn.cluster import DBSCAN

db = DBSCAN(eps=15.5, min_samples=5).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = pd.DataFrame(db.labels_,columns=['Cluster ID'])
result = pd.concat((data,labels), axis=1)
result.plot.scatter(x='x',y='y',c='Cluster ID', colormap='jet')

[11]: <matplotlib.axes._subplots.AxesSubplot at 0x1d5b08dcc50>

8
1.4 8.4 Spectral Clustering
One of the main limitations of the k-means clustering algorithm is its tendency to seek for globular-
shaped clusters. Thus, it does not work when applied to datasets with arbitrary-shaped clusters
or when the cluster centroids overlapped with one another. Spectral clustering can overcome this
limitation by exploiting properties of the similarity graph to overcome such limitations. To illustrate
this, consider the following two-dimensional datasets.

[12]: import pandas as pd

data1 = pd.read_csv('2d_data.txt', delimiter=' ', names=['x','y'])

data2 = pd.read_csv('elliptical.txt', delimiter=' ', names=['x','y'])

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

data1.plot.scatter(x='x',y='y',ax=ax1)
data2.plot.scatter(x='x',y='y',ax=ax2)

[12]: <matplotlib.axes._subplots.AxesSubplot at 0x1d5b0be1160>

9
Below, we demonstrate the results of applying k-means to the datasets (with k=2).

[13]: from sklearn import cluster

k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)

k_means.fit(data1)
labels1 = pd.DataFrame(k_means.labels_,columns=['Cluster ID'])
result1 = pd.concat((data1,labels1), axis=1)

k_means2 = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)

k_means2.fit(data2)
labels2 = pd.DataFrame(k_means2.labels_,columns=['Cluster ID'])
result2 = pd.concat((data2,labels2), axis=1)

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

result1.plot.scatter(x='x',y='y',c='Cluster ID',colormap='jet',ax=ax1)
ax1.set_title('K-means Clustering')
result2.plot.scatter(x='x',y='y',c='Cluster ID',colormap='jet',ax=ax2)
ax2.set_title('K-means Clustering')

[13]: Text(0.5,1,'K-means Clustering')

10
The plots above show the poor performance of k-means clustering. Next, we apply spectral clus-
tering to the datasets. Spectral clustering converts the data into a similarity graph and applies the
normalized cut graph partitioning algorithm to generate the clusters. In the example below, we
use the Gaussian radial basis function as our aﬀinity (similarity) measure. Users need to tune the
kernel parameter (gamma) value in order to obtain the appropriate clusters for the given dataset.

[14]: from sklearn import cluster

import pandas as pd

spectral = cluster.
,→SpectralClustering(n_clusters=2,random_state=1,affinity='rbf',gamma=5000)

spectral.fit(data1)
labels1 = pd.DataFrame(spectral.labels_,columns=['Cluster ID'])
result1 = pd.concat((data1,labels1), axis=1)

spectral2 = cluster.
,→SpectralClustering(n_clusters=2,random_state=1,affinity='rbf',gamma=100)

spectral2.fit(data2)
labels2 = pd.DataFrame(spectral2.labels_,columns=['Cluster ID'])
result2 = pd.concat((data2,labels2), axis=1)

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

result1.plot.scatter(x='x',y='y',c='Cluster ID',colormap='jet',ax=ax1)
ax1.set_title('Spectral Clustering')
result2.plot.scatter(x='x',y='y',c='Cluster ID',colormap='jet',ax=ax2)
ax2.set_title('Spectral Clustering')

[14]: Text(0.5,1,'Spectral Clustering')

11
1.5 8.5 Summary
This tutorial illustrates examples of using different Python’s implementation of clustering algo-
rithms. Algorithms such as k-means, spectral clustering, and DBScan are designed to create dis-
joint partitions of the data whereas the single-link, complete-link, and group average algorithms
are designed to generate a hierarchy of cluster partitions.
References: [1] George Karypis, Eui-Hong Han, and Vipin Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. IEEE Computer 32(8): 68-75, 1999.

09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Clustering
No ratings yet
Clustering
55 pages
AppliedML Chap1 Clustering
No ratings yet
AppliedML Chap1 Clustering
37 pages
Military 14 Bolt Axle Manual
67% (6)
Military 14 Bolt Axle Manual
29 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
AAI101 - Session 2 - Unsupervised Learning
No ratings yet
AAI101 - Session 2 - Unsupervised Learning
38 pages
Building K-Means Clustering Algorithm From Scratch
No ratings yet
Building K-Means Clustering Algorithm From Scratch
10 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Science Analysis Final Project
No ratings yet
Data Science Analysis Final Project
10 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Data Science Project Training Report
No ratings yet
Data Science Project Training Report
19 pages
Vid 4
No ratings yet
Vid 4
6 pages
Data Mining
No ratings yet
Data Mining
18 pages
Unit 4
No ratings yet
Unit 4
46 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
GT Class TKGTPS
100% (1)
GT Class TKGTPS
84 pages
Week 8 DS Practical
No ratings yet
Week 8 DS Practical
13 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Practical 03
No ratings yet
Practical 03
3 pages
Practical File of AI and ML
No ratings yet
Practical File of AI and ML
26 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
Current Rating of Conventional Loco TM
No ratings yet
Current Rating of Conventional Loco TM
2 pages
PeerEval Unsupervised
No ratings yet
PeerEval Unsupervised
6 pages
KDD WS 24 25 E4 Clustering I
No ratings yet
KDD WS 24 25 E4 Clustering I
2 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
UNIT5
No ratings yet
UNIT5
60 pages
EXP-6 K Mean Clustring
No ratings yet
EXP-6 K Mean Clustring
6 pages
Seminar 10
No ratings yet
Seminar 10
3 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
K Means
No ratings yet
K Means
25 pages
Yunsu Han KNN K Means
No ratings yet
Yunsu Han KNN K Means
8 pages
DMDW Lab8
No ratings yet
DMDW Lab8
3 pages
20 ENG 016 Assignment 8
No ratings yet
20 ENG 016 Assignment 8
4 pages
2.3 Aiml Rishit
No ratings yet
2.3 Aiml Rishit
7 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
Zara
No ratings yet
Zara
47 pages
Clustering in Python-Dr. Afsaneh Javadi
No ratings yet
Clustering in Python-Dr. Afsaneh Javadi
8 pages
K Means
No ratings yet
K Means
9 pages
Mutable Plaits
No ratings yet
Mutable Plaits
12 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Textile Wet Processing Through Natural Product
No ratings yet
Textile Wet Processing Through Natural Product
14 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
Rezolvare Subiecte Olimpiada Europeana de Matematica Pentru Fete 2015
100% (1)
Rezolvare Subiecte Olimpiada Europeana de Matematica Pentru Fete 2015
12 pages
Intro To ML Ass
No ratings yet
Intro To ML Ass
3 pages
Operator'S Manual: Ship Security Alert System (Ssas)
No ratings yet
Operator'S Manual: Ship Security Alert System (Ssas)
35 pages
Xlunifac
No ratings yet
Xlunifac
118 pages
Adaptive Headlights System For Four Wheelers A Review
No ratings yet
Adaptive Headlights System For Four Wheelers A Review
9 pages
Task Description PC Comm. Electrical
No ratings yet
Task Description PC Comm. Electrical
7 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
K-Means Clustering Report
No ratings yet
K-Means Clustering Report
2 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Week-1 Introduction To BDDA-TWM PDF
No ratings yet
Week-1 Introduction To BDDA-TWM PDF
48 pages
Reading Content From The File: Application 61: File Writing Demo
No ratings yet
Reading Content From The File: Application 61: File Writing Demo
200 pages
Topic - Syllogism: DIRECTIONS For Questions 1 - 10: in Each of The Questions Below Are Given Three Statements Followed by
No ratings yet
Topic - Syllogism: DIRECTIONS For Questions 1 - 10: in Each of The Questions Below Are Given Three Statements Followed by
4 pages
GEC 2-Mathematics in The Modern World: (Module 3 Week 7-9)
No ratings yet
GEC 2-Mathematics in The Modern World: (Module 3 Week 7-9)
7 pages
GSM Network: S.H.Jamali
No ratings yet
GSM Network: S.H.Jamali
42 pages
3 Questions On Surds and Indices
No ratings yet
3 Questions On Surds and Indices
7 pages
Statistika Elementer
No ratings yet
Statistika Elementer
65 pages
Wireless EV Charging Parking Lot Model Project Report
No ratings yet
Wireless EV Charging Parking Lot Model Project Report
57 pages
Architectural Technologists BizHouse - Uk
No ratings yet
Architectural Technologists BizHouse - Uk
3 pages
CSD Final - Doc
No ratings yet
CSD Final - Doc
12 pages
Lab 4 Report
No ratings yet
Lab 4 Report
10 pages
Improving Indirect Heat Transfer To Solids by Better Mixing
No ratings yet
Improving Indirect Heat Transfer To Solids by Better Mixing
8 pages
MATS2001 Physical Properties of Materials: Some Preliminary Aspects of Quantum Physics
No ratings yet
MATS2001 Physical Properties of Materials: Some Preliminary Aspects of Quantum Physics
7 pages
Counting Eggs and Larvae
No ratings yet
Counting Eggs and Larvae
5 pages
Satellite Comm Note Ymca
No ratings yet
Satellite Comm Note Ymca
5 pages
Analog and Mixed Mode Vlsi Design
No ratings yet
Analog and Mixed Mode Vlsi Design
4 pages
Materials and Design
No ratings yet
Materials and Design
12 pages
JIT Template
No ratings yet
JIT Template
2 pages
ATS001386E GSS G GST G Garden Speaker Datasheet
No ratings yet
ATS001386E GSS G GST G Garden Speaker Datasheet
2 pages
HP95 Ai
No ratings yet
HP95 Ai
3 pages
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet

Tutorial 8

Uploaded by

Tutorial 8

Uploaded by

tutorial8

November 20, 2020

1 Module 8: Cluster Analysis

1.1 8.1 K-means Clustering

[1]: import pandas as pd

titles = ['user','Jaws','Star Wars','Exorcist','Omen']

[1]: user Jaws Star Wars Exorcist Omen

[2]: from sklearn import cluster

[3]: centroids = k_means.cluster_centers_

[3]: Jaws Star Wars Exorcist Omen

[4]: import numpy as np

[4]: user Jaws Star Wars Exorcist Omen Cluster ID

[5]: import matplotlib.pyplot as plt

[6]: import pandas as pd

[6]: Name Warm-blooded Gives Birth Aquatic Creature \

Aerial Creature Has Legs Hibernates Class

1.2.1 8.2.1 Single Link (MIN)

[7]: from scipy.cluster import hierarchy

[8]: Z = hierarchy.linkage(X.as_matrix(), 'complete')

[9]: Z = hierarchy.linkage(X.as_matrix(), 'average')

1.3 8.3 Density-Based Clustering

[10]: import pandas as pd

data = pd.read_csv('chameleon.data', delimiter=' ', names=['x','y'])

[10]: <matplotlib.axes._subplots.AxesSubplot at 0x1d5b0a1eb00>

[11]: from sklearn.cluster import DBSCAN

[11]: <matplotlib.axes._subplots.AxesSubplot at 0x1d5b08dcc50>

[12]: import pandas as pd

data1 = pd.read_csv('2d_data.txt', delimiter=' ', names=['x','y'])

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

[12]: <matplotlib.axes._subplots.AxesSubplot at 0x1d5b0be1160>

[13]: from sklearn import cluster

k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)

k_means2 = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

[13]: Text(0.5,1,'K-means Clustering')

[14]: from sklearn import cluster

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

[14]: Text(0.5,1,'Spectral Clustering')

You might also like