0% found this document useful (0 votes)

58 views

Assignment No: 2: Aim: Objective

The document discusses implementing a single-pass algorithm for clustering. It defines clustering and describes the single-pass algorithm which assigns each document to the cluster with the highest similarity or creates a new cluster. The document also covers measures of association that can be used to calculate similarity between documents, including Jaccard, Dice, and cosine coefficients. An example application of the single-pass algorithm with a threshold of 0.59 results in two clusters being identified.

Uploaded by

Pratik B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Assignment No: 2: Aim: Objective

Uploaded by

Pratik B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment No: 2

Aim: Implementation of Single-pass Algorithm for Clustering.

Objective: To study,
1. Concept of Clustering.
2. Single-pass clustering Algorithm.
3. Measure of Association.

Theory:

Clustering:-

Sometimes several classifications may naturally be associated with each other, having
many concepts in common. These classifications form a cluster. The clustering may be based
on physical location.

In functional clustering, classifications are centered around functions. In data

clustering, classifications are centered around data, while in object based clustering,
classifications are centered around objects. The object based paradigm uses the concept of
class hierarchies to naturally express clustering.

Encapsulation is a concept similar to the concept of clustering. In encapsulation, an

enclosure is placed around a grouping and only threw well-defined openings are interactions
allowed. A cluster may not have such a well defined enclosure.

The Goals of Clustering:-

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.
But how to decide what constitutes a good clustering? It can be shown that there is no
absolute “best” criterion which would be independent of the final aim of the clustering.
Consequently, it is the user which must supply this criterion, in such a way that the result of
the clustering will suit their needs. For instance, user could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable
groupings (“useful” data classes) or in finding unusual data objects (outlier detection).

Clustering Requirements:-
The main requirements that a clustering algorithm should satisfy are:
• Scalability;
• Dealing with different types of attributes;
• Discovering clusters with arbitrary shape;
• Minimal requirements for domain knowledge to determine input parameters;
• Ability to deal with noise and outliers;
• Insensitivity to order of input records;
• High dimensionality;
1
• Interpretability and usability.

Single Pass Clustering:-

Single Pass clustering quickly by which we make incremental clustering to stream

data. This clustering technique provides us with a simple yet flexible technique for stream
data1. Given a collection of clusters and a threshold value h, if a new document n has the
highest similarity more than h to some cluster, the document n is appended to the cluster, and
if there exists no cluster, a new cluster is generated which contains only the document n.
Clearly Single Pass Clustering is suitable for incremental clustering to temporal data (or data
stream) since, once a document is assigned to a cluster, it is not changed in the future.

Single-pass methods:-

 Similarity: Computed between input and all representatives of existing clusters

 Time: O(N log N)
 Space: O(M)
 Advantages: Simple, requiring only one pass through data; may be useful as starting
point for reallocation methods
 Disadvantages: Produces large clusters early in process; clusters formed are
dependent on order of input data

Single-pass Algorithm:-

1. Assign the first document D1 as the representative of cluster C1 .

2. Calculate the similarity Sj between document Di and each cluster, keeping track of the
largest Smax.
3. If Smax is greater than Sthreshold, add the document to the appropriate cluster, else
create a new cluster with centroid Di.
4. If documents remain, repeat from step 2.

Measures of Association:-
Association is the similarity between objects characterized by discrete state
attributes. The measure of similarity or association is designed to quantify likeness between
the objects in such a way that an object in a group is more like the other members of the
group that is like any object outside the group then a cluster method enables such a group
structure to be discovered.

There are five commonly used measures of association in IR.

1. | X ∩ Y | Simple matching coefficient

2. | X ∩ Y | / | X | + | Y | Dices coefficient
3. | X ∩ Y | / | X U Y | Jaccard’s coefficient
4. | X ∩ Y | / | X |1/2 * | Y |½ Cosine coefficient
5. | X ∩ Y | / min ( | X |, | Y | ) Overlap coefficient

Example:-
2
Objects {1, 2, 3, 4, 5, 6}

Threshold: 0.59

Clusters are:-

Conclusion:- Thus, we have implemented the single pass algorithm for clustering.

FAQs:-

3
1. What is clustering?
Ans:-
Sometimes several classifications may naturally be associated with each other, having
many concepts in common. These classifications form a cluster. The clustering may be based
on physical location.
In functional clustering, classifications are centered around functions. In data
clustering, classifications are centered around data, while in object based clustering,
classifications are centered around objects. The object based paradigm uses the concept of
class hierarchies to naturally express clustering.

2. Which are measures of association?

Ans:-
Association is the similarity between objects characterized by discrete state
attributes. There are five commonly used measures of association in IR.

1. | X ∩ Y | Simple matching coefficient

3. What is distance based clustering?

Ans:-
To identify the 4 clusters into which the data can be divided; the similarity criterion is
Distance: two or more objects belong to the same cluster if they are “close” according to a
given distance (in this case geometrical distance). This is called distance based clustering.

4. What is conceptual clustering?

Ans:-
Conceptual Clustering:-Two or more objects belong to the same cluster if this one
defines a concept common to all that objects. In other words, objects are grouped
according to their fit to descriptive concepts, not according to simple similarity measures.

5. What is goal of clustering?

Ans:-
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.

6. What is single pass algorithm?

Ans:-
1. Assign the first document D1 as the representative of cluster C1 .
2. Calculate the similarity Sj between document Di and each cluster, keeping track of the
largest Smax.
3. If Smax is greater than Sthreshold, add the document to the appropriate cluster, else
create a new cluster with centroid Di.
4. If documents remain, repeat from step 2.

iNUKE NU6000: Service Manual
0% (1)
iNUKE NU6000: Service Manual
31 pages
CS CSPS Guidelines Aviation Sector English V1.5
100% (1)
CS CSPS Guidelines Aviation Sector English V1.5
60 pages
Paper-2 Clustering Algorithms in Data Mining A Review
No ratings yet
Paper-2 Clustering Algorithms in Data Mining A Review
7 pages
Proposal 1
100% (1)
Proposal 1
10 pages
IR 2 - Implementation of Single Pass Algorithm For Clustering
No ratings yet
IR 2 - Implementation of Single Pass Algorithm For Clustering
4 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
dm 4
No ratings yet
dm 4
76 pages
YEAH
No ratings yet
YEAH
2 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
ML-UNSUPERVISED
No ratings yet
ML-UNSUPERVISED
35 pages
A Comprehensive Survey of Clustering Algorithms
No ratings yet
A Comprehensive Survey of Clustering Algorithms
30 pages
Assignment Cover Sheet: Research Report On Clustering in Data Mining
No ratings yet
Assignment Cover Sheet: Research Report On Clustering in Data Mining
13 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
data mining 5
No ratings yet
data mining 5
39 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
53 pages
Clustering
No ratings yet
Clustering
12 pages
Sai 2016 7555988
No ratings yet
Sai 2016 7555988
5 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Clustering: Unsupervised Learning Methods 15-381
No ratings yet
Clustering: Unsupervised Learning Methods 15-381
25 pages
Module-2 Part-1 - Merged
No ratings yet
Module-2 Part-1 - Merged
66 pages
Clustering
No ratings yet
Clustering
29 pages
AI26
No ratings yet
AI26
3 pages
rohini_69115191178
No ratings yet
rohini_69115191178
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Unit-4_Part-2
No ratings yet
Unit-4_Part-2
45 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
UNIT-4
No ratings yet
UNIT-4
106 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Clustering
No ratings yet
Clustering
34 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
M5
No ratings yet
M5
40 pages
Unit-4th Question-Bank Solution.docx
No ratings yet
Unit-4th Question-Bank Solution.docx
52 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
UNIT5
No ratings yet
UNIT5
60 pages
Review Paper On Clustering and Validation Techniques
No ratings yet
Review Paper On Clustering and Validation Techniques
5 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
PSO and WDO Data Clusterin
No ratings yet
PSO and WDO Data Clusterin
19 pages
Clustering
No ratings yet
Clustering
104 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
UnSupervised Learning
No ratings yet
UnSupervised Learning
3 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Pattern Recognition Lecture 3
No ratings yet
Pattern Recognition Lecture 3
44 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Assignment No.: 5: Aim: Theory
No ratings yet
Assignment No.: 5: Aim: Theory
3 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Conflation
No ratings yet
Conflation
6 pages
Hamlet Thesis Statements About Ophelia
100% (3)
Hamlet Thesis Statements About Ophelia
7 pages
yam2013
No ratings yet
yam2013
18 pages
NHD Annotated Bibliography
No ratings yet
NHD Annotated Bibliography
6 pages
011663 Tanvi Dissertation 23030
No ratings yet
011663 Tanvi Dissertation 23030
116 pages
Quick Manual For Pro100
No ratings yet
Quick Manual For Pro100
21 pages
Cohesion and Coherence
No ratings yet
Cohesion and Coherence
11 pages
Weichai Diesel Engine Parts Strarter Motor 612600
No ratings yet
Weichai Diesel Engine Parts Strarter Motor 612600
1 page
Ordinary - Agricultural Science Paper 1 6115-1 - First Proof 14.03.2023
No ratings yet
Ordinary - Agricultural Science Paper 1 6115-1 - First Proof 14.03.2023
24 pages
AIRBUS INDEX FULL Editable Word 2023 09 Make PDF N Print
No ratings yet
AIRBUS INDEX FULL Editable Word 2023 09 Make PDF N Print
1 page
OSINT Intelligence Report on Elon Musk
No ratings yet
OSINT Intelligence Report on Elon Musk
19 pages
Descriptive Text
No ratings yet
Descriptive Text
2 pages
MSDS Idrolin-K SM
No ratings yet
MSDS Idrolin-K SM
25 pages
Fish Processing Technology
No ratings yet
Fish Processing Technology
4 pages
Chapter 9 - Audit Sampling
100% (2)
Chapter 9 - Audit Sampling
37 pages
Permutation Tests for Complex Data Theory Applications and Software Wiley Series in Probability and Statistics 1st Edition Fortunato Pesarin 2024 scribd download
100% (5)
Permutation Tests for Complex Data Theory Applications and Software Wiley Series in Probability and Statistics 1st Edition Fortunato Pesarin 2024 scribd download
61 pages
Broccoli Stem Pesto Pasta
No ratings yet
Broccoli Stem Pesto Pasta
2 pages
Scci - Irr PT Bukit Asam - Maret 2021
No ratings yet
Scci - Irr PT Bukit Asam - Maret 2021
9 pages
QQ BMS Commissioning Method Statement-Draft
71% (7)
QQ BMS Commissioning Method Statement-Draft
84 pages
Section02 Answerkey PDF
No ratings yet
Section02 Answerkey PDF
11 pages
02 Scope Management
No ratings yet
02 Scope Management
21 pages
Critical Theory, Poststructuralism, Postmodernism Their Sociological Relevance - Agger
No ratings yet
Critical Theory, Poststructuralism, Postmodernism Their Sociological Relevance - Agger
28 pages
Python Training Course Content
No ratings yet
Python Training Course Content
6 pages
Philippine Agricultural Engineering Standard Paes 409:2002 Agricultural Structures - Milking Parlor
No ratings yet
Philippine Agricultural Engineering Standard Paes 409:2002 Agricultural Structures - Milking Parlor
12 pages
s4 Physics Paper 3 Exam 1
No ratings yet
s4 Physics Paper 3 Exam 1
4 pages
7.0 Ionic Equilibria: Tutorial
No ratings yet
7.0 Ionic Equilibria: Tutorial
13 pages
MOL Exercise Sol E
No ratings yet
MOL Exercise Sol E
28 pages
Critical Care MCI CCU Assignment
No ratings yet
Critical Care MCI CCU Assignment
14 pages

Assignment No: 2: Aim: Objective

Uploaded by

Assignment No: 2: Aim: Objective

Uploaded by

Assignment No: 2

Aim: Implementation of Single-pass Algorithm for Clustering.

In functional clustering, classifications are centered around functions. In data

Encapsulation is a concept similar to the concept of clustering. In encapsulation, an

The Goals of Clustering:-

Single Pass Clustering:-

Single Pass clustering quickly by which we make incremental clustering to stream

 Similarity: Computed between input and all representatives of existing clusters

1. Assign the first document D1 as the representative of cluster C1 .

There are five commonly used measures of association in IR.

1. | X ∩ Y | Simple matching coefficient

2. Which are measures of association?

1. | X ∩ Y | Simple matching coefficient

3. What is distance based clustering?

4. What is conceptual clustering?

5. What is goal of clustering?

6. What is single pass algorithm?

You might also like