Unsupervised Algorithms Unit3

Uploaded by

monishar9895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Unsupervised Algorithms Unit3

Uploaded by

monishar9895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

Unsupervised Algorithms: Clustering

• Input data : set of documents to classify ,not even class labels are
provided
• Task of the classifier : separate documents into
subsets (clusters) automatically separating procedure is called
clustering
Applications of Clustering
• IR: presentation of results (clustering of documents)
• Summarisation:
1. clustering of similar documents for multi-document summarisation
2. clustering of similar sentences for re-generation of sentences
• Topic Segmentation: clustering of similar paragraphs (adjacent or non-
adjacent) for detection of topic structure/importance
• Lexical semantics: clustering of words by cooccurrence patterns
Example
• Class labels can be generated automatically but are different from
labels specified by humans usually.
• Thus, solving the whole classification problem with no human
intervention is hard ,If class labels are provided, clustering is more
effective
The Cluster Hypothesis
• “Similar documents tend to be relevant to the same requests”
• Issues:
1. Variants: “Documents that are relevant to the same topics are
similar”
2. Simple vs. complex topics
3. Evaluation, prediction
• The cluster hypothesis is the main motivation behind document
clustering
Similarity Coefficients
1.Simple matching:

2.Dice’s Coefficient:

3.Cosine Coefficient:
Document-document similarity
• Document representative
• > Select features to characterize document: terms,phrases, citations
• > Select weighting scheme for these features:
a) Binary, raw/relative frequency, divergence measure
b) Title / body / abstract, controlled vocabulary, selected topics,
taxonomy
• Similarity / association coefficient or dissimilarity/ distance metric
Clustering methods
• Non-hierarchic methods
• => partitions
> High efficiency, low effectiveness»
• Hierarchic methods
• => hierarchic structures - small clusters of highly similar documents
nested within larger clusters of less similar documents
• Divisive => monothetic classifications
• Agglomerative => polythetic classifications !!
Partitioning method
• Generic procedure:
• The first object becomes the first cluster
• Each subsequent object is matched against existing clusters
1. It is assigned to the most similar cluster if the similarity measure is
above a set threshold
2. Otherwise it forms a new cluster
• Re-shuffling of documents into clusters can be done iteratively to
increase cluster similarity
Representation of clustered hierarches
kohonen feature in maps on text
• Clustering is used in information retrieval systems to enhance the
efficiency and effectiveness of the retrieval process.
• Clustering is achieved by partitioning the documents in a collection
into classes such that documents that are associated with each other
are assigned to the same cluster.
Types of Clustering
Desiderata for clustering
Non-hierarchical (partitioning) clustering
• Partitional clustering algorithms produce a set of k non-nested
partitions corresponding to k clusters of n objects.
• Advantage: not necessary to compare each object to each other
object, just comparisons of objects – cluster centroids necessary
• Optimal partitioning clustering algorithms are O(kn)
• Main algorithm: K-means
K-means Clustering

• Input: number K of clusters to be generated

• Each cluster represented by its documents centroid
• K-Means algorithm:
• partition docs among the K clusters
• each document assigned to cluster with closest centroid
• Re compute centroids
• repeat process until centroids do not change
• Vector space model:
• As in vector space classification, we measure relatedness between
vectors by Euclidean distance .. .which is almost equivalent to cosine
similarity.
• Each cluster in K-means is defined by a centroid.
• Objective/partitioning criterion: minimize the average squared
difference from the centroid
K-means: Basic idea
K-means algorithm
• We try to find the minimum average squared difference by iterating
two steps: 
• reassignment: assign each vector to its closest centroid
• recomputation: recompute each centroid as the average of the
vectors that were assigned to it in reassignment
• K-means can start with selecting as initial clusters centers K randomly
chosen objects, namely the seeds.
• It then moves the cluster centers around in space in order to minimize
RSS(A measure of how well the centroids represent the members of
their clusters is the Residual Sum of Squares , the squared distance of
each vector from its centroid summed over all vectors This is done
iteratively by repeating two steps (reassignment , re computation)
until a stopping criterion is met
• K-means can start with selecting as initial clusters centers K randomly
chosen objects, namely the seeds.
• It then moves the cluster centers around in space in order to
minimize RSS(A measure of how well the centroids represent the
members of their clusters is the Residual Sum of Squares , the
squared distance of each vector from its centroid summed over all
vectors This is done iteratively by repeating two steps (reassignment ,
re computation) until a stopping criterion is met.
• Algorithm Input:
• K: no of clusters
• D: data set containing n objects
• Output : a set of K clustersSteps
• 1. Arbitrarily choose k objects from D as the initial cluster centers
• 2. Repeat
• 3. Reassign each object to the cluster to which the object is the most
similar based on the distance measure
• 4. Recompute the centroid for newly formed cluster
• 5. Until no change
•
Certainly! Let's compute the k-means clustering algorithm for the
given data.
• Given data: Medicine A:(1,1)Medicine A:(1,1) Medicine B:
(2,1)Medicine B:(2,1) Medicine C:(4,3)Medicine C:(4,3) Medicine D:
(5,4)Medicine D:(5,4)
• We want to cluster these
• Given data: Medicine A:(1,1)Medicine A:(1,1) Medicine B:
(2,1)Medicine B:(2,1) Medicine C:(4,3)Medicine C:(4,3) Medicine D:
(5,4)Medicine D:(5,4)
• We want to cluster these
• Given data: Medicine A:(1,1)Medicine A:(1,1) Medicine B:
(2,1)Medicine B:(2,1) Medicine C:(4,3)Medicine C:(4,3) Medicine D:
(5,4)Medicine D:(5,4)
• We want to cluster these medicines into k=2 clusters based on their
attributes (weight index and pH).
• Step 2: Assignment Step:
• Calculate the distance of each medicine to each centroid using
Euclidean distance.
• Assign each medicine to the nearest centroid.
• Form clusters based on the assignments.
• : Repeat:
• Repeat steps 2 and 3 until convergence.
• Since we only have two iterations in this example, we can consider
this as the final result.
• So, the final clusters are:
• Cluster 1: Medicine A, Medicine B (Centroid: (1.5,1)(1.5,1))
• Cluster 2: Medicine C, Medicine D (Centroid: (4.5,3.5)(4.5,3.5))
• This is how k-means clustering algorithm works mathematically for
the given data.
Hierarchical Clustering
• Goal: to create a hierarchy of clusters by either decomposing a large
cluster into smaller ones, or agglomerating previously defined clusters
into larger ones
• Build a tree based hierarchical taxonomy from a set of document is
called dendrogram
• There are two types of hierarchical clustering, Divisive and
Agglomerative. Hierarchical
• Agglomerative
• Divisive
• Method used for computing cluster distances defines three variants of
the algorithm
• 1. single-linkage
• 2. complete-linkage
• 3. average-link age
Methods to find closest pair of clusters:
Single Linkage
• In single linkage hierarchical clustering, the distance between two
clusters is defined as the shortest distance between two points in
each cluster.
• For example, the distance between clusters “r” and “s” to the left is
equal to the length of the arrow between their two closest points
• Complete Linkage:
• In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the
left is equal to the length of the arrow between their two furthest
points.
• Average linkage:

• In average linkage hierarchical clustering, the distance between two

clusters is defined as the average distance between each point in one
cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length
each arrow between connecting the points of one cluster to the other

15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Clustering
No ratings yet
Clustering
28 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering
No ratings yet
Clustering
35 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster
100% (1)
Cluster
72 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
IR_Lec_36
No ratings yet
IR_Lec_36
29 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
M5
No ratings yet
M5
40 pages
Lecture4 Slides
No ratings yet
Lecture4 Slides
43 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
M5
No ratings yet
M5
40 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering: Unsupervised Learning Methods 15-381
No ratings yet
Clustering: Unsupervised Learning Methods 15-381
25 pages
Clustering
No ratings yet
Clustering
75 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
39 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
Unit 5
No ratings yet
Unit 5
63 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Week 10 Lecture - Introduction to Clustering(1)
No ratings yet
Week 10 Lecture - Introduction to Clustering(1)
35 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Lecture 4.6 Unsupervised-learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-learning Clustering
60 pages
Clustering
No ratings yet
Clustering
125 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Week 9
No ratings yet
Week 9
66 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ccs345-Ethics and Ai Lab Manual
100% (1)
Ccs345-Ethics and Ai Lab Manual
24 pages
FODS Record
No ratings yet
FODS Record
66 pages
ITPHP09
No ratings yet
ITPHP09
16 pages
Oslabmanual
No ratings yet
Oslabmanual
130 pages
18-deeprl
No ratings yet
18-deeprl
19 pages
Two Pointer Algorithm: Li Yin January 19, 2019
No ratings yet
Two Pointer Algorithm: Li Yin January 19, 2019
15 pages
MMD_Hw2
No ratings yet
MMD_Hw2
2 pages
Circularqueue Merged
No ratings yet
Circularqueue Merged
35 pages
CS 211 Term 1 Assignment PDF
No ratings yet
CS 211 Term 1 Assignment PDF
3 pages
Cse 551 Mcs
No ratings yet
Cse 551 Mcs
6 pages
CS583 Association Sequential Patterns
No ratings yet
CS583 Association Sequential Patterns
65 pages
Alg DS1 Example Test 2
No ratings yet
Alg DS1 Example Test 2
3 pages
('Christos Papadimitriou', 'Midterm 2', ' (Solution) ') Fall 2010 PDF
No ratings yet
('Christos Papadimitriou', 'Midterm 2', ' (Solution) ') Fall 2010 PDF
5 pages
mca-2-sem-data-structures-analysis-of-algorithms-k_240217_145512
No ratings yet
mca-2-sem-data-structures-analysis-of-algorithms-k_240217_145512
3 pages
STD Set
No ratings yet
STD Set
3 pages
ZZZZ
No ratings yet
ZZZZ
5 pages
Sliding Window Two Pointers Problems New
No ratings yet
Sliding Window Two Pointers Problems New
3 pages
Cpds Imp Ques Theory and Pgms
No ratings yet
Cpds Imp Ques Theory and Pgms
5 pages
Array sorting
No ratings yet
Array sorting
5 pages
Simulated Annealing
No ratings yet
Simulated Annealing
54 pages
Planar Graph and Trees
No ratings yet
Planar Graph and Trees
16 pages
Unit-3-Greedy Method PDF
No ratings yet
Unit-3-Greedy Method PDF
22 pages
Insertion Sort Algorithm
No ratings yet
Insertion Sort Algorithm
14 pages
Chapter 3 - Searching and Planning
No ratings yet
Chapter 3 - Searching and Planning
56 pages
C++ Programming: From Problem Analysis To Program Design, Fifth Edition
No ratings yet
C++ Programming: From Problem Analysis To Program Design, Fifth Edition
60 pages
ML Lecture#4
No ratings yet
ML Lecture#4
109 pages
Critical Thinking Assignment 4-2 Critical Thinking Assignment 4-2
No ratings yet
Critical Thinking Assignment 4-2 Critical Thinking Assignment 4-2
3 pages
Hash Function
No ratings yet
Hash Function
43 pages
Sybba (CA) Sem III Labbook
No ratings yet
Sybba (CA) Sem III Labbook
137 pages
DSA 2
No ratings yet
DSA 2
4 pages
Quicksort: Quicksort (A, P, R) : Ifn 1:return Q Partition (A, P, R) Quicksort (A, P, q-1) Quicksort (A, q+1, R)
No ratings yet
Quicksort: Quicksort (A, P, R) : Ifn 1:return Q Partition (A, P, R) Quicksort (A, P, q-1) Quicksort (A, q+1, R)
6 pages
Dsa - Two Marks All Units
No ratings yet
Dsa - Two Marks All Units
18 pages
Cs1201 Design and Analysis of Algorithm
No ratings yet
Cs1201 Design and Analysis of Algorithm
27 pages
CSE408 Vertex Cover and Bin Packing
No ratings yet
CSE408 Vertex Cover and Bin Packing
19 pages

Unsupervised Algorithms Unit3

Uploaded by

Unsupervised Algorithms Unit3

Uploaded by

Unsupervised Algorithms: Clustering

• Input: number K of clusters to be generated

• In average linkage hierarchical clustering, the distance between two

You might also like