0% found this document useful (0 votes)

179 views31 pages

Cluster Analysis in Python Chapter1 PDF

This document discusses unsupervised learning and cluster analysis in Python. It begins by explaining the differences between labeled and unlabeled data, with unlabeled data being the focus of unsupervised learning techniques. Unsupervised learning algorithms like clustering are used to find patterns in unlabeled data and group similar items together. The document then covers hierarchical and k-means clustering algorithms in Python using SciPy and demonstrates how to perform each type of clustering on sample Pokémon sighting data. Finally, it discusses the importance of preparing data for clustering through techniques like normalization prior to analyzing the data.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

179 views31 pages

Cluster Analysis in Python Chapter1 PDF

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unsupervised

learning: basics
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Everyday example: Google news
How does Google News classify articles?

Unsupervised Learning Algorithm: Clustering

Match frequent terms in articles to nd

similarity

CLUSTER ANALYSIS IN PYTHON

Labeled and unlabeled data
Data with no labels Point 1: (1, 2)

Point 2: (2, 2)

Point 3: (3, 1)

Data with labels Point 1: (1, 2), Label: Danger Zone

Point 2: (2, 2), Label: Normal Zone

Point 3: (3, 1), Label: Normal Zone

CLUSTER ANALYSIS IN PYTHON

What is unsupervised learning?
A group of machine learning algorithms that nd patterns in data

Data for algorithms has not been labeled, classi ed or characterized

The objective of the algorithm is to interpret any structure in the data

Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

CLUSTER ANALYSIS IN PYTHON

What is clustering?
The process of grouping items with similar characteristics

Items in groups similar to each other than in other groups

Example: distance between points on a 2D plane

CLUSTER ANALYSIS IN PYTHON

Plotting data for clustering - Pokemon sightings
from matplotlib import pyplot as plt

x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]

[Link](x_coordinates, y_coordinates)
[Link]()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Up next - some
practice
C L U S T E R A N A LY S I S I N P Y T H O N
Basics of cluster
analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
What is a cluster?
A group of items with similar characteristics

Google News: articles where similar words and

word associations appear together

Customer Segments

CLUSTER ANALYSIS IN PYTHON

Clustering algorithms
Hierarchical clustering

K means clustering

Other clustering algorithms: DBSCAN, Gaussian Methods

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering in SciPy
from [Link] import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = [Link]({'x_coordinate': x_coordinates,
'y_coordinate': y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')

[Link](x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
[Link]()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
K-means clustering in SciPy
from [Link] import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
[Link]((1000,2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = [Link]({'x_coordinate': x_coordinates, 'y_coordinate': y_coordinates})

centroids,_ = kmeans(df, 3)
df['cluster_labels'], _ = vq(df, centroids)

[Link](x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
[Link]()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
Next up: hands-on
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Data preparation for
cluster analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why do we need to prepare data for clustering?
Variables have incomparable units (product dimensions in cm, price in $)

Variables with same units have vastly different scales and variances (expenditures on cereals, travel)

Data in raw form may lead to bias in clustering

Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

CLUSTER ANALYSIS IN PYTHON

Normalization of data
Normalization: process of rescaling data to a standard deviation of 1

x_new = x / std_dev(x)

from [Link] import whiten

data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]

scaled_data = whiten(data)
print(scaled_data)

[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]

CLUSTER ANALYSIS IN PYTHON

Illustration: normalization of data
# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data

[Link](data,
label="original")
[Link](scaled_data,
label="scaled")

# Show legend and display plot

[Link]()
[Link]()

CLUSTER ANALYSIS IN PYTHON

Next up: some DIY
exercises
C L U S T E R A N A LY S I S I N P Y T H O N

Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
29 pages
Visualizing PCA with factoextra
No ratings yet
Visualizing PCA with factoextra
74 pages
Categorical Data Frequency Distribution
No ratings yet
Categorical Data Frequency Distribution
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
77 pages
Exam With Model Answers
No ratings yet
Exam With Model Answers
4 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
No ratings yet
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
56 pages
Time Series Analysis - An Introduction
No ratings yet
Time Series Analysis - An Introduction
38 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
11 pages
Intro To Traditional and Bayesian M Using R-Guilford 2017
No ratings yet
Intro To Traditional and Bayesian M Using R-Guilford 2017
330 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
3 pages
Time Series Analysis with Tableau
No ratings yet
Time Series Analysis with Tableau
28 pages
Data Science Interview Stats Q&A
No ratings yet
Data Science Interview Stats Q&A
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
1.1 Basic Time Series Decomposition PDF
No ratings yet
1.1 Basic Time Series Decomposition PDF
38 pages
PCA Analysis of Consumer Data Insights
100% (1)
PCA Analysis of Consumer Data Insights
5 pages
Notes On Time Series Analysis
No ratings yet
Notes On Time Series Analysis
111 pages
Remote Interview Prep for Voleon Roles
No ratings yet
Remote Interview Prep for Voleon Roles
8 pages
Bok:978 1 4899 7218 7 PDF
No ratings yet
Bok:978 1 4899 7218 7 PDF
375 pages
Graph Neural Networks Overview
No ratings yet
Graph Neural Networks Overview
75 pages
Ai325CA CCD Image Sensor Overview
No ratings yet
Ai325CA CCD Image Sensor Overview
8 pages
K Means
No ratings yet
K Means
22 pages
Topological Data Analysis
No ratings yet
Topological Data Analysis
26 pages
Monte Carlo Studies Using SAS
100% (2)
Monte Carlo Studies Using SAS
258 pages
UGC Statistics Curriculum 2001
No ratings yet
UGC Statistics Curriculum 2001
101 pages
Visual Analytics for Business Insights
No ratings yet
Visual Analytics for Business Insights
36 pages
GAMs for Statistical Learning
No ratings yet
GAMs for Statistical Learning
10 pages
Supervised-Unsupervised Learning
No ratings yet
Supervised-Unsupervised Learning
2 pages
Key Features of NumPy Arrays
No ratings yet
Key Features of NumPy Arrays
15 pages
Artificial Neural Networks - Introduction
No ratings yet
Artificial Neural Networks - Introduction
31 pages
Finding Limits Graphically
No ratings yet
Finding Limits Graphically
56 pages
Time Series Analysis
100% (1)
Time Series Analysis
2 pages
IT Semester Curriculum Overview
No ratings yet
IT Semester Curriculum Overview
191 pages
Sentiment Analysis Using Recurrent Neural Network
No ratings yet
Sentiment Analysis Using Recurrent Neural Network
7 pages
Coordinate Descent and Golden Selection Search
No ratings yet
Coordinate Descent and Golden Selection Search
2 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
33 pages
ARIMA Models for Naira-Dollar Exchange Rate
No ratings yet
ARIMA Models for Naira-Dollar Exchange Rate
8 pages
One-Sample T-Test
No ratings yet
One-Sample T-Test
9 pages
Financial Time Series Analysis
No ratings yet
Financial Time Series Analysis
31 pages
Deep Learning
No ratings yet
Deep Learning
800 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Random Variable Generation
No ratings yet
Random Variable Generation
5 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Unit Scatter Plots for Homework 4 Analysis
No ratings yet
Unit Scatter Plots for Homework 4 Analysis
4 pages
Machine Learning Essentials
0% (1)
Machine Learning Essentials
2 pages
PLS Algorithms in Multivariate Calibration
No ratings yet
PLS Algorithms in Multivariate Calibration
17 pages
Octave Programming and Linear Algebra
No ratings yet
Octave Programming and Linear Algebra
17 pages
Digital Image Processing Overview
No ratings yet
Digital Image Processing Overview
133 pages
Pattern Recognition for CS Scholars
0% (1)
Pattern Recognition for CS Scholars
37 pages
R Time Series Analysis Guide
No ratings yet
R Time Series Analysis Guide
23 pages
Big Data Analytics Exam Answers Cleaned
No ratings yet
Big Data Analytics Exam Answers Cleaned
4 pages
Cluster Analysis Basics in Python
No ratings yet
Cluster Analysis Basics in Python
31 pages
Cluster Analysis in Python Chapter4 PDF
No ratings yet
Cluster Analysis in Python Chapter4 PDF
30 pages
Hierarchical Clustering in Python Guide
No ratings yet
Hierarchical Clustering in Python Guide
30 pages
Chapter 4
No ratings yet
Chapter 4
30 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Unit 3 Unsupervised Learning
No ratings yet
Unit 3 Unsupervised Learning
9 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Overview of Clustering Algorithms
No ratings yet
Overview of Clustering Algorithms
83 pages
Audio Processing in Python Guide
No ratings yet
Audio Processing in Python Guide
17 pages
Customize Seaborn Plot Styles and Colors
No ratings yet
Customize Seaborn Plot Styles and Colors
54 pages
Python Functions for Audio Transcription
No ratings yet
Python Functions for Audio Transcription
46 pages
Python SpeechRecognition Guide
No ratings yet
Python SpeechRecognition Guide
23 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Seaborn Data Visualization Guide
No ratings yet
Seaborn Data Visualization Guide
26 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
ML Workflows for Cybersecurity
No ratings yet
ML Workflows for Cybersecurity
39 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
35 pages
Time-Series Visualization with Matplotlib
No ratings yet
Time-Series Visualization with Matplotlib
27 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
30 pages
Relational Plots and Subplots in Seaborn
No ratings yet
Relational Plots and Subplots in Seaborn
38 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Designing ML Workflows in Python
No ratings yet
Designing ML Workflows in Python
42 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Credit Risk Modeling for Data Scientists
100% (1)
Credit Risk Modeling for Data Scientists
35 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
RFM Customer Segmentation in Python
No ratings yet
RFM Customer Segmentation in Python
33 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
Credit Risk Modeling in Python Chapter2
100% (1)
Credit Risk Modeling in Python Chapter2
36 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
IoT Data Analysis with Python
No ratings yet
IoT Data Analysis with Python
34 pages
PySpark Data Cleaning Guide
0% (1)
PySpark Data Cleaning Guide
20 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Bail Reckoner Project Overview
No ratings yet
Bail Reckoner Project Overview
9 pages
Cyberark Pas Install and Configure Course Agenda: Description
No ratings yet
Cyberark Pas Install and Configure Course Agenda: Description
5 pages
Unix Lab Manual
No ratings yet
Unix Lab Manual
56 pages
Ebook - Brute Force Attack
No ratings yet
Ebook - Brute Force Attack
15 pages
Kerberos: Authentication & Security
No ratings yet
Kerberos: Authentication & Security
6 pages
Protocol Buffers: Google's Serialization Format
No ratings yet
Protocol Buffers: Google's Serialization Format
28 pages
Food Service Database System Guide
No ratings yet
Food Service Database System Guide
19 pages
WP Data Encryption With Servicenow
No ratings yet
WP Data Encryption With Servicenow
23 pages
Architecture Tradeoff Analysis Method
No ratings yet
Architecture Tradeoff Analysis Method
25 pages
Embedded Systems Development Tools
No ratings yet
Embedded Systems Development Tools
3 pages
Power BI Practical Lab Manual
No ratings yet
Power BI Practical Lab Manual
106 pages
Software Developer Resume
No ratings yet
Software Developer Resume
5 pages
Jsignpdf Quick Start Guide: Josef Cacek
No ratings yet
Jsignpdf Quick Start Guide: Josef Cacek
23 pages
Dbdesign Hotel
No ratings yet
Dbdesign Hotel
12 pages
Cloud Security Audits for Providers
No ratings yet
Cloud Security Audits for Providers
8 pages
Ranorex Tutorial
No ratings yet
Ranorex Tutorial
106 pages
(Original PDF) Data Analytics For Accounting by Vernon Richardson Download
100% (1)
(Original PDF) Data Analytics For Accounting by Vernon Richardson Download
53 pages
MerlinCorp SAM Discovery Tool Comparison 2017
No ratings yet
MerlinCorp SAM Discovery Tool Comparison 2017
178 pages
Firebase Admin SDK For PHP
No ratings yet
Firebase Admin SDK For PHP
60 pages
CASE Tools in Systems Development
No ratings yet
CASE Tools in Systems Development
13 pages
Nuxt 3 Cheatsheet
0% (2)
Nuxt 3 Cheatsheet
2 pages
Deliverables Log Template
No ratings yet
Deliverables Log Template
9 pages
Club Registration - Paulo
No ratings yet
Club Registration - Paulo
9 pages
BM - 2025 Summer TT 01-06-25
No ratings yet
BM - 2025 Summer TT 01-06-25
9 pages
Java Variable Scope and Output Analysis
No ratings yet
Java Variable Scope and Output Analysis
10 pages
Primer (1) - PRIMERS - Merged
100% (1)
Primer (1) - PRIMERS - Merged
26 pages
PHP Frameworks Performance Study
No ratings yet
PHP Frameworks Performance Study
7 pages
PerformanceTestBestPracticesforLoadRunner PDF
No ratings yet
PerformanceTestBestPracticesforLoadRunner PDF
28 pages
Title-Gift Shop 2. Aims and Objectives
No ratings yet
Title-Gift Shop 2. Aims and Objectives
8 pages
BI Strategies for Banking Retention
No ratings yet
BI Strategies for Banking Retention
13 pages

Cluster Analysis in Python Chapter1 PDF

Uploaded by

Cluster Analysis in Python Chapter1 PDF

Uploaded by

Unsupervised

Unsupervised Learning Algorithm: Clustering

Match frequent terms in articles to nd

CLUSTER ANALYSIS IN PYTHON

Data with labels Point 1: (1, 2), Label: Danger Zone

Point 2: (2, 2), Label: Normal Zone

Point 3: (3, 1), Label: Normal Zone

CLUSTER ANALYSIS IN PYTHON

Data for algorithms has not been labeled, classi ed or characterized

The objective of the algorithm is to interpret any structure in the data

Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

CLUSTER ANALYSIS IN PYTHON

Items in groups similar to each other than in other groups

Example: distance between points on a 2D plane

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

Google News: articles where similar words and

CLUSTER ANALYSIS IN PYTHON

Other clustering algorithms: DBSCAN, Gaussian Methods

CLUSTER ANALYSIS IN PYTHON

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

CLUSTER ANALYSIS IN PYTHON

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

df = [Link]({'x_coordinate': x_coordinates, 'y_coordinate': y_coordinates})

CLUSTER ANALYSIS IN PYTHON

Data in raw form may lead to bias in clustering

Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

CLUSTER ANALYSIS IN PYTHON

from [Link] import whiten

CLUSTER ANALYSIS IN PYTHON

# Initialize original, scaled data

# Show legend and display plot

CLUSTER ANALYSIS IN PYTHON

You might also like