0% found this document useful (0 votes)
58 views

Assignment No: 2: Aim: Objective

The document discusses implementing a single-pass algorithm for clustering. It defines clustering and describes the single-pass algorithm which assigns each document to the cluster with the highest similarity or creates a new cluster. The document also covers measures of association that can be used to calculate similarity between documents, including Jaccard, Dice, and cosine coefficients. An example application of the single-pass algorithm with a threshold of 0.59 results in two clusters being identified.

Uploaded by

Pratik B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Assignment No: 2: Aim: Objective

The document discusses implementing a single-pass algorithm for clustering. It defines clustering and describes the single-pass algorithm which assigns each document to the cluster with the highest similarity or creates a new cluster. The document also covers measures of association that can be used to calculate similarity between documents, including Jaccard, Dice, and cosine coefficients. An example application of the single-pass algorithm with a threshold of 0.59 results in two clusters being identified.

Uploaded by

Pratik B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment No: 2

Aim: Implementation of Single-pass Algorithm for Clustering.

Objective: To study,
1. Concept of Clustering.
2. Single-pass clustering Algorithm.
3. Measure of Association.

Theory:

Clustering:-

Sometimes several classifications may naturally be associated with each other, having
many concepts in common. These classifications form a cluster. The clustering may be based
on physical location.

In functional clustering, classifications are centered around functions. In data


clustering, classifications are centered around data, while in object based clustering,
classifications are centered around objects. The object based paradigm uses the concept of
class hierarchies to naturally express clustering.

Encapsulation is a concept similar to the concept of clustering. In encapsulation, an


enclosure is placed around a grouping and only threw well-defined openings are interactions
allowed. A cluster may not have such a well defined enclosure.

The Goals of Clustering:-

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.
But how to decide what constitutes a good clustering? It can be shown that there is no
absolute “best” criterion which would be independent of the final aim of the clustering.
Consequently, it is the user which must supply this criterion, in such a way that the result of
the clustering will suit their needs. For instance, user could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable
groupings (“useful” data classes) or in finding unusual data objects (outlier detection).

Clustering Requirements:-
The main requirements that a clustering algorithm should satisfy are:
• Scalability;
• Dealing with different types of attributes;
• Discovering clusters with arbitrary shape;
• Minimal requirements for domain knowledge to determine input parameters;
• Ability to deal with noise and outliers;
• Insensitivity to order of input records;
• High dimensionality;
1
• Interpretability and usability.

Single Pass Clustering:-

Single Pass clustering quickly by which we make incremental clustering to stream


data. This clustering technique provides us with a simple yet flexible technique for stream
data1. Given a collection of clusters and a threshold value h, if a new document n has the
highest similarity more than h to some cluster, the document n is appended to the cluster, and
if there exists no cluster, a new cluster is generated which contains only the document n.
Clearly Single Pass Clustering is suitable for incremental clustering to temporal data (or data
stream) since, once a document is assigned to a cluster, it is not changed in the future.

Single-pass methods:-

 Similarity: Computed between input and all representatives of existing clusters


 Time: O(N log N)
 Space: O(M)
 Advantages: Simple, requiring only one pass through data; may be useful as starting
point for reallocation methods
 Disadvantages: Produces large clusters early in process; clusters formed are
dependent on order of input data

Single-pass Algorithm:-

1. Assign the first document D1 as the representative of cluster C1 .


2. Calculate the similarity Sj between document Di and each cluster, keeping track of the
largest Smax.
3. If Smax is greater than Sthreshold, add the document to the appropriate cluster, else
create a new cluster with centroid Di.
4. If documents remain, repeat from step 2.

Measures of Association:-
Association is the similarity between objects characterized by discrete state
attributes. The measure of similarity or association is designed to quantify likeness between
the objects in such a way that an object in a group is more like the other members of the
group that is like any object outside the group then a cluster method enables such a group
structure to be discovered.

There are five commonly used measures of association in IR.

1. | X ∩ Y | Simple matching coefficient


2. | X ∩ Y | / | X | + | Y | Dices coefficient
3. | X ∩ Y | / | X U Y | Jaccard’s coefficient
4. | X ∩ Y | / | X |1/2 * | Y |½ Cosine coefficient
5. | X ∩ Y | / min ( | X |, | Y | ) Overlap coefficient

Example:-
2
Objects {1, 2, 3, 4, 5, 6}

Threshold: 0.59

Clusters are:-

Conclusion:- Thus, we have implemented the single pass algorithm for clustering.

FAQs:-

3
1. What is clustering?
Ans:-
Sometimes several classifications may naturally be associated with each other, having
many concepts in common. These classifications form a cluster. The clustering may be based
on physical location.
In functional clustering, classifications are centered around functions. In data
clustering, classifications are centered around data, while in object based clustering,
classifications are centered around objects. The object based paradigm uses the concept of
class hierarchies to naturally express clustering.

2. Which are measures of association?


Ans:-
Association is the similarity between objects characterized by discrete state
attributes. There are five commonly used measures of association in IR.

1. | X ∩ Y | Simple matching coefficient


2. | X ∩ Y | / | X | + | Y | Dices coefficient
3. | X ∩ Y | / | X U Y | Jaccard’s coefficient
4. | X ∩ Y | / | X |1/2 * | Y |½ Cosine coefficient
5. | X ∩ Y | / min (| X |, | Y | ) Overlap coefficient

3. What is distance based clustering?


Ans:-
To identify the 4 clusters into which the data can be divided; the similarity criterion is
Distance: two or more objects belong to the same cluster if they are “close” according to a
given distance (in this case geometrical distance). This is called distance based clustering.

4. What is conceptual clustering?


Ans:-
Conceptual Clustering:-Two or more objects belong to the same cluster if this one
defines a concept common to all that objects. In other words, objects are grouped
according to their fit to descriptive concepts, not according to simple similarity measures.

5. What is goal of clustering?


Ans:-
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.

6. What is single pass algorithm?


Ans:-
1. Assign the first document D1 as the representative of cluster C1 .
2. Calculate the similarity Sj between document Di and each cluster, keeping track of the
largest Smax.
3. If Smax is greater than Sthreshold, add the document to the appropriate cluster, else
create a new cluster with centroid Di.
4. If documents remain, repeat from step 2.

You might also like