0% found this document useful (0 votes)

2 views52 pages

Clustering

Clustering is the process of grouping similar documents together, with applications in information retrieval, summarization, topic segmentation, and lexical semantics. The document outlines various clustering techniques, including K-means and hierarchical clustering, and discusses their properties, advantages, and computational complexities. It emphasizes the importance of clustering in organizing information and improving retrieval effectiveness.

Uploaded by

taniyaofficial7578

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views52 pages

Clustering

Uploaded by

taniyaofficial7578

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

What is clustering?

Applications of clustering in information

retrieval

K-means algorithm

Introduction to hierarchical clustering

Single-link and complete-link clustering

Definition
(Document) clustering is the process of grouping a set of

documents into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.

Difference between classification & clustering

Classification Clustering

supervised learning unsupervised learning

classes are human-defined Clusters are inferred from

and part of the input to the the data without human input.
learning algorithm

output = membership in Output = membership in

class only class + distance from centroid
(“degree of cluster
membership”)
Cluster hypothesis

Documents in the same cluster behave similarly with respect to relevance to

information needs.

All applications of clustering in IR are based (directly or indirectly) on the cluster

hypothesis.

Van Rijsbergen’s original wording (1979): “closely associated documents tend to

be relevant to the same requests”.
Applications of Clustering
IR: presentation of results (clustering of documents)

Summarisation: clustering of similar documents for multi-document summarisation

clustering of similar sentences for re-generation of sentences

Topic Segmentation: clustering of similar paragraphs (adjacent or non-adjacent) for

detection of topic structure/importance

Lexical semantics: clustering of words by cooccurrence patterns

Clustering search results
Clustering news articles
Clustering Words

https://colah.github.io/posts/2015-01-Visualizing-Representations/
Types of Clustering
Hard clustering v. soft clustering

Hard clustering: every object is member in only one cluster

Soft clustering: objects can be members in more than one cluster

Hierarchical v. non-hierarchical clustering

Hierarchical clustering: pairs of most-similar clusters are iteratively linked until all
objects are in a clustering relationship

Non-hierarchical clustering results in flat clusters of “similar” documents

Desiderata for clustering
General goal: put related docs in the same cluster, put unrelated docs in different clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set we are clustering.
Initially, we will assume the number of clusters K is given.
There also exist semiautomatic methods for determining K
Secondary goals in clustering
Avoid very small and very large clusters
Define clusters that are easy to explain to the user
Many others . . .
Non-hierarchical (partitioning) clustering
Partitional clustering algorithms produce a set of k non-nested partitions
corresponding to k clusters of n objects.

Advantage: not necessary to compare each object to each other object, just
comparisons of objects – cluster centroids necessary

Optimal partitioning clustering algorithms are O(kn)

Main algorithm: K-means

K-means: Basic idea
K-means algorithm
Worked Example: Set of points to be clustered
Random seeds + Assign points to closest center
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Example: Assign points to closest centroid
Worked Example: Recompute cluster centroids
Worked Ex: Centroids and assignments after convergence
K-means is guaranteed to converge: Proof
RSS decreases during each reassignment step because each vector is moved to a closer
centroid
RSS decreases during each recomputation step.
This follows from the definition of a centroid:
the new centroid is the vector for which RSSk reaches its minimum
There is only a finite number of clusterings.
Thus: We must reach a fixed point.
Finite set & monotonically decreasing evaluation function ⇒ convergence
Assumption: Ties are broken consistently
Other properties of K-means
Fast convergence

K-means typically converges in around 10-20 iterations (if we don’t care about a few documents switching back and
forth)

However, complete convergence can take many more iterations.

Non-optimality

K-means is not guaranteed to find the optimal solution.

If we start with a bad set of seeds, the resulting clustering can be horrible.

Dependence on initial centroids

Solution 1: Use i clusterings, choose one with lowest RSS

Solution 2: Use prior hierarchical clustering step to find seeds with good coverage of document space
Time complexity of K-means
Reassignment step: O(KNM) (we need to compute KN document-centroid distances,
each of which costs O(M)

Recomputation step: O(NM) (we need to add each of the document’s < M values to one
of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all important dimensions

Hierarchical clustering
Imagine we now want to create a hierarchy in the form of a binary tree.

Assumes a similarity measure for determining the similarity of two clusters.

Up to now, our similarity measures were for documents.

We will look at different cluster similarity measures.

Main algorithm: HAC (hierarchical agglomerative clustering)

HAC: Basic algorithm
Start with each document in a separate cluster

Then repeatedly merge the two clusters that are most similar

Until there is only one cluster.

The history of merging is a hierarchy in the form of a binary tree.

The standard way of depicting this history is a dendrogram.

A dendrogram
Term–document matrix to document–document matrix
Hierarchical clustering: agglomerative (BottomUp, greedy)
Computational complexity of the basic algorithm
Hierarchical clustering: similarity functions
Example: hierarchical clustering; similarity functions
Single Link is O(n2)
Clustering Result under Single Link
Complete Link
Complete Link
Clustering result under complete link
Example: gene expression data
An example from biology: cluster genes by function
Survey 112 rat genes which are suspected to participate in development of CNS
Take 9 data points: 5 embryonic (E11, E13, E15, E18, E21), 3 postnatal (P0, P7,
P14) and one adult
Measure expression of gene (how much mRNA in cell?)
These measures are normalised logs; for our purposes, we can consider them as
weights
Cluster analysis determines which genes operate at the same time
Rat CNS gene expression data (excerpt)
Rat CNS gene clustering – single link
Rat CNS gene clustering – complete link
Rat CNS gene clustering – group average link
Flat or hierarchical clustering?
When a hierarchical structure is desired: hierarchical algorithm

Humans are bad at interpreting hierarchical clusterings (unless cleverly visualised)

For high efficiency, use flat clustering

For deterministic results, use HAC

HAC also can be applied if K can’t be predetermined (can start without knowing K)
Take-away
Partitional clustering
Provides less information but is more efficient (best: O(kn))
K-means
Hierarchical clustering
Best algorithms O(n2) complexity
Single-link vs. complete-link (vs. group-average)

Hierarchical and non-hierarchical clustering fulfills different needs

Machine Learning With Go
No ratings yet
Machine Learning With Go
9 pages
Chapter 13 Recurrent Neural Networks
No ratings yet
Chapter 13 Recurrent Neural Networks
46 pages
BCS301 - Mathematics for Computer Science- Tutorial Questions
No ratings yet
BCS301 - Mathematics for Computer Science- Tutorial Questions
2 pages
Module 4
No ratings yet
Module 4
50 pages
Akhilesh Chauhan
No ratings yet
Akhilesh Chauhan
2 pages
GoogleCloudManual
No ratings yet
GoogleCloudManual
24 pages
NehaSakhawat_ComputerTutor_CV
No ratings yet
NehaSakhawat_ComputerTutor_CV
4 pages
ML Manual AIDS
No ratings yet
ML Manual AIDS
44 pages
Notes 1149 Unit 3
No ratings yet
Notes 1149 Unit 3
32 pages
The role of artificial intelligence in supply chain management mapping the territory
No ratings yet
The role of artificial intelligence in supply chain management mapping the territory
25 pages
SE PM QB sol
No ratings yet
SE PM QB sol
18 pages
11. Advancing App With Real-Time Image Analysis, Machine Learning, And Vision _ Mastering ARKit_ Apple’s Augmented Reality App Development Platform
No ratings yet
11. Advancing App With Real-Time Image Analysis, Machine Learning, And Vision _ Mastering ARKit_ Apple’s Augmented Reality App Development Platform
8 pages
Design and Testing of Automated Smoke Monitoring Sensors in Vehicles
No ratings yet
Design and Testing of Automated Smoke Monitoring Sensors in Vehicles
8 pages
Clustering
No ratings yet
Clustering
32 pages
Aniruddha-Research-31031523001
No ratings yet
Aniruddha-Research-31031523001
8 pages
Artificial-Intelligence-An-Introduction-to-the-Legal-Policy-and-Ethical-Issues_JXD
No ratings yet
Artificial-Intelligence-An-Introduction-to-the-Legal-Policy-and-Ethical-Issues_JXD
46 pages
son_merged
No ratings yet
son_merged
30 pages
Chapter4 Clustering Compressed
No ratings yet
Chapter4 Clustering Compressed
48 pages
Prediction of Diabetes Using Machine Learning Techniques
No ratings yet
Prediction of Diabetes Using Machine Learning Techniques
10 pages
CH - 5 Clustering ?
No ratings yet
CH - 5 Clustering ?
22 pages
Research Methodology and Intellectual Property Rights
No ratings yet
Research Methodology and Intellectual Property Rights
118 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
12 Text Clustering
No ratings yet
12 Text Clustering
26 pages
M3 - Unsupervised Machine Learning
No ratings yet
M3 - Unsupervised Machine Learning
35 pages
Opportunities For Gen AI in Financial Services
No ratings yet
Opportunities For Gen AI in Financial Services
19 pages
Lesson Plan -ML
No ratings yet
Lesson Plan -ML
12 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Understanding Clustering_ A Comprehensive Guide to
No ratings yet
Understanding Clustering_ A Comprehensive Guide to
5 pages
IR_Lec_36
No ratings yet
IR_Lec_36
29 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
No ratings yet
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
66 pages
ML_QUESTION BANK PART I
No ratings yet
ML_QUESTION BANK PART I
6 pages
Lecture 4.6 Unsupervised-learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-learning Clustering
60 pages
Midterm 2003
No ratings yet
Midterm 2003
11 pages
unit5_CSM_ML
No ratings yet
unit5_CSM_ML
32 pages
UNIT IV
No ratings yet
UNIT IV
19 pages
Bits+and+Bytes_July+2024+Edition
No ratings yet
Bits+and+Bytes_July+2024+Edition
17 pages
37 Application of k means clustering
No ratings yet
37 Application of k means clustering
38 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
4 pages
Clustering
No ratings yet
Clustering
20 pages
Addernet: Do We Really Need Multiplications in Deep Learning?
No ratings yet
Addernet: Do We Really Need Multiplications in Deep Learning?
8 pages
Module-1 Data Analytics in Healthcare Systems
No ratings yet
Module-1 Data Analytics in Healthcare Systems
23 pages
Module-5 IoT Notes
No ratings yet
Module-5 IoT Notes
19 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Artificial Intelligence Class 10 Syllabus
No ratings yet
Artificial Intelligence Class 10 Syllabus
5 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Unit 3 unsupervised learning algorith
No ratings yet
Unit 3 unsupervised learning algorith
15 pages
ARTIFICIAL INTELLIGENCE QP
No ratings yet
ARTIFICIAL INTELLIGENCE QP
7 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
P138 A Backtesting Protocol
No ratings yet
P138 A Backtesting Protocol
11 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
Mach Learning Qs
No ratings yet
Mach Learning Qs
7 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING (AI & ML)
No ratings yet
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING (AI & ML)
20 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Clustering
No ratings yet
Clustering
75 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Flat Clustering & Hierarchical Clustering in I.R
No ratings yet
Flat Clustering & Hierarchical Clustering in I.R
13 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Fichas de Aprendizaje IBM AI - Quizletv2
No ratings yet
Fichas de Aprendizaje IBM AI - Quizletv2
26 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Build A Machine Learning Portfolio
No ratings yet
Build A Machine Learning Portfolio
18 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Unit 4
No ratings yet
Unit 4
74 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering
No ratings yet
Clustering
39 pages
Machine Learning
No ratings yet
Machine Learning
54 pages
Clustering
No ratings yet
Clustering
84 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Cluster
100% (1)
Cluster
72 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet

Clustering

Uploaded by

Clustering

Uploaded by

What is clustering?

Applications of clustering in information

Introduction to hierarchical clustering

Single-link and complete-link clustering

documents into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.

supervised learning unsupervised learning

classes are human-defined Clusters are inferred from

output = membership in Output = membership in

Documents in the same cluster behave similarly with respect to relevance to

All applications of clustering in IR are based (directly or indirectly) on the cluster

Van Rijsbergen’s original wording (1979): “closely associated documents tend to

Summarisation: clustering of similar documents for multi-document summarisation

clustering of similar sentences for re-generation of sentences

Topic Segmentation: clustering of similar paragraphs (adjacent or non-adjacent) for

Lexical semantics: clustering of words by cooccurrence patterns

Hard clustering: every object is member in only one cluster

Soft clustering: objects can be members in more than one cluster

Hierarchical v. non-hierarchical clustering

Non-hierarchical clustering results in flat clusters of “similar” documents

Optimal partitioning clustering algorithms are O(kn)

Main algorithm: K-means

However, complete convergence can take many more iterations.

K-means is not guaranteed to find the optimal solution.

Dependence on initial centroids

Solution 1: Use i clusterings, choose one with lowest RSS

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all important dimensions

Assumes a similarity measure for determining the similarity of two clusters.

Up to now, our similarity measures were for documents.

We will look at different cluster similarity measures.

Main algorithm: HAC (hierarchical agglomerative clustering)

Until there is only one cluster.

The history of merging is a hierarchy in the form of a binary tree.

The standard way of depicting this history is a dendrogram.

Humans are bad at interpreting hierarchical clusterings (unless cleverly visualised)

For high efficiency, use flat clustering

For deterministic results, use HAC

Hierarchical and non-hierarchical clustering fulfills different needs

You might also like