0% found this document useful (0 votes)
127 views

Cluster Analysis in Python Chapter1 PDF

This document discusses unsupervised learning and cluster analysis in Python. It begins by explaining the differences between labeled and unlabeled data, with unlabeled data being the focus of unsupervised learning techniques. Unsupervised learning algorithms like clustering are used to find patterns in unlabeled data and group similar items together. The document then covers hierarchical and k-means clustering algorithms in Python using SciPy and demonstrates how to perform each type of clustering on sample Pokémon sighting data. Finally, it discusses the importance of preparing data for clustering through techniques like normalization prior to analyzing the data.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Cluster Analysis in Python Chapter1 PDF

This document discusses unsupervised learning and cluster analysis in Python. It begins by explaining the differences between labeled and unlabeled data, with unlabeled data being the focus of unsupervised learning techniques. Unsupervised learning algorithms like clustering are used to find patterns in unlabeled data and group similar items together. The document then covers hierarchical and k-means clustering algorithms in Python using SciPy and demonstrates how to perform each type of clustering on sample Pokémon sighting data. Finally, it discusses the importance of preparing data for clustering through techniques like normalization prior to analyzing the data.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Unsupervised

learning: basics
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Everyday example: Google news
How does Google News classify articles?

Unsupervised Learning Algorithm: Clustering

Match frequent terms in articles to nd


similarity

CLUSTER ANALYSIS IN PYTHON


Labeled and unlabeled data
Data with no labels Point 1: (1, 2)

Point 2: (2, 2)

Point 3: (3, 1)

Data with labels Point 1: (1, 2), Label: Danger Zone

Point 2: (2, 2), Label: Normal Zone

Point 3: (3, 1), Label: Normal Zone

CLUSTER ANALYSIS IN PYTHON


What is unsupervised learning?
A group of machine learning algorithms that nd patterns in data

Data for algorithms has not been labeled, classi ed or characterized

The objective of the algorithm is to interpret any structure in the data

Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

CLUSTER ANALYSIS IN PYTHON


What is clustering?
The process of grouping items with similar characteristics

Items in groups similar to each other than in other groups

Example: distance between points on a 2D plane

CLUSTER ANALYSIS IN PYTHON


Plotting data for clustering - Pokemon sightings
from matplotlib import pyplot as plt

x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]

plt.scatter(x_coordinates, y_coordinates)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Up next - some
practice
C L U S T E R A N A LY S I S I N P Y T H O N
Basics of cluster
analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
What is a cluster?
A group of items with similar characteristics

Google News: articles where similar words and


word associations appear together

Customer Segments

CLUSTER ANALYSIS IN PYTHON


Clustering algorithms
Hierarchical clustering

K means clustering

Other clustering algorithms: DBSCAN, Gaussian Methods

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering in SciPy
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,


10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates,
'y_coordinate': y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')

sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
K-means clustering in SciPy
from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
random.seed((1000,2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,


10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate': y_coordinates})

centroids,_ = kmeans(df, 3)
df['cluster_labels'], _ = vq(df, centroids)

sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
Next up: hands-on
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Data preparation for
cluster analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why do we need to prepare data for clustering?
Variables have incomparable units (product dimensions in cm, price in $)

Variables with same units have vastly different scales and variances (expenditures on cereals, travel)

Data in raw form may lead to bias in clustering

Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

CLUSTER ANALYSIS IN PYTHON


Normalization of data
Normalization: process of rescaling data to a standard deviation of 1

x_new = x / std_dev(x)

from scipy.cluster.vq import whiten

data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]

scaled_data = whiten(data)
print(scaled_data)

[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]

CLUSTER ANALYSIS IN PYTHON


Illustration: normalization of data
# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data


plt.plot(data,
label="original")
plt.plot(scaled_data,
label="scaled")

# Show legend and display plot


plt.legend()
plt.show()

CLUSTER ANALYSIS IN PYTHON


Next up: some DIY
exercises
C L U S T E R A N A LY S I S I N P Y T H O N

You might also like