0% found this document useful (0 votes)
21 views6 pages

FMLASS3Q7 - Jupyter Notebook

This document discusses using hierarchical clustering on customer data from an online retail store. It loads the data, selects features for clustering, calculates distances between data points, performs agglomerative clustering with different linkage methods (single, complete, average), and plots the resulting dendrograms. The average linkage dendrogram provides the most informative clustering of customers into three groups with distinct purchasing behaviors. Hierarchical clustering can help understand customer segmentation.

Uploaded by

ch20b049
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

FMLASS3Q7 - Jupyter Notebook

This document discusses using hierarchical clustering on customer data from an online retail store. It loads the data, selects features for clustering, calculates distances between data points, performs agglomerative clustering with different linkage methods (single, complete, average), and plots the resulting dendrograms. The average linkage dendrogram provides the most informative clustering of customers into three groups with distinct purchasing behaviors. Hierarchical clustering can help understand customer segmentation.

Uploaded by

ch20b049
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [49]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering

In [50]: # Load the Online Retail data


df = pd.read_csv('OnlineRetailPreProcessed.csv')

In [51]: df

Out[51]: Unnamed: 0 CustomerID Price Quantity Delay

0 0 12346.0 -0.233043 -0.391699 2.322023

1 1 12347.0 0.292637 0.378244 -0.893733

2 2 12348.0 -0.008126 -0.258357 -0.169196

3 3 12349.0 -0.018406 -0.082001 -0.725005

4 4 12350.0 -0.196820 -0.340082 2.163220

... ... ... ... ... ...

4367 4367 18280.0 -0.208356 -0.357288 1.845615

4368 4368 18281.0 -0.222891 -0.387397 0.882873

4369 4369 18282.0 -0.209090 -0.348685 -0.834182

4370 4370 18283.0 0.026963 2.868727 -0.873883

4371 4371 18287.0 -0.007600 -0.090604 -0.486801

4372 rows × 5 columns

In [52]: def plot_clusters(data,labels=None,title_cluster="Agglomerative Clustering"


fig = plt.figure(figsize = (16, 9))
ax = plt.axes(projection ="3d")
ax.scatter3D(data[:,0],data[:,1],data[:,2],c=labels)
ax.set_title(title_cluster)
plt.show()

In [53]: # Select a subset of features for clustering


features = ['Price','Quantity','Delay']

# Compute the distance matrix
X = df[features].values
D = squareform(pdist(X))

localhost:8888/notebooks/FMLASS3Q7.ipynb 1/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [54]: agglo_cluster_single=AgglomerativeClustering(n_clusters=3,metric='euclidean
agglo_cluster_single.fit(X)

Out[54]: AgglomerativeClustering(linkage='single', metric='euclidean', n_clusters=


3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

In [55]: plot_clusters(X,agglo_cluster_single.labels_,title_cluster="Agglomerative C

In [56]: agglo_cluster_comp=AgglomerativeClustering(n_clusters=3,metric='euclidean',
agglo_cluster_comp.fit(X)

Out[56]: AgglomerativeClustering(linkage='complete', metric='euclidean', n_clusters


=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

localhost:8888/notebooks/FMLASS3Q7.ipynb 2/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [57]: plot_clusters(X,agglo_cluster_comp.labels_,title_cluster="Agglomerative Clu

In [58]: agglo_cluster_avg=AgglomerativeClustering(n_clusters=3,metric='euclidean',l
agglo_cluster_avg.fit(X)

Out[58]: AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=


3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

localhost:8888/notebooks/FMLASS3Q7.ipynb 3/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [59]: plot_clusters(X,agglo_cluster_avg.labels_,title_cluster="Agglomerative Clus

In [60]: # Perform hierarchical clustering with different linkage methods


linkage_methods = ['single', 'complete', 'average']
linkages = [linkage(D, method) for method in linkage_methods]

<ipython-input-60-f613d96f39ea>:3: ClusterWarning: scipy.cluster: The symm


etric non-negative hollow observation matrix looks suspiciously like an un
condensed distance matrix
linkages = [linkage(D, method) for method in linkage_methods]

localhost:8888/notebooks/FMLASS3Q7.ipynb 4/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [61]: # Plot the dendrograms


for i, linkage in enumerate(linkages):
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
dendrogram(linkage, ax=ax)
ax.set_xlabel('Customer ID')
ax.set_ylabel('Distance')
ax.set_title('Dendrogram for {} linkage'.format(linkage_methods[i]))
plt.show()

localhost:8888/notebooks/FMLASS3Q7.ipynb 5/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

Use cases
The choice of linkage method depends on the specific application. Single linkage is often
used in anomaly detection, as it can identify outliers that are connected to other data points
by a chain of similar data points. Complete linkage is often used in image segmentation, as
it can produce compact clusters. Average linkage is a general-purpose linkage method that
is often used in a variety of applications.

In the context of the Online Retail data, the average linkage dendrogram appears to be the
most informative. It shows that the customers are grouped into three distinct clusters, with
each cluster having a different purchasing behavior. For example, one cluster may consist
of customers who frequently purchase large quantities of low-priced items, while another
cluster may consist of customers who infrequently purchase high-priced items.

localhost:8888/notebooks/FMLASS3Q7.ipynb 6/6

You might also like