0% found this document useful (0 votes)
35 views4 pages

UNIT 3 (2marks) TA

Text clustering involves grouping texts into clusters where texts within a cluster are more similar to each other than texts in other clusters. Text clustering is used for applications like document retrieval, fake news detection, language translation, and spam filtering. Clustering differs from classification in that clustering does not use predefined labels or categories and aims to discover natural groupings within the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views4 pages

UNIT 3 (2marks) TA

Text clustering involves grouping texts into clusters where texts within a cluster are more similar to each other than texts in other clusters. Text clustering is used for applications like document retrieval, fake news detection, language translation, and spam filtering. Clustering differs from classification in that clustering does not use predefined labels or categories and aims to discover natural groupings within the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

UNIT 2 (2 marks)

Show the principle behind text clustering.


Text clustering is the task of grouping a set of unlabelled texts in such a way that
texts in the same cluster are more similar to each other than to those in other
clusters. Text clustering algorithms process text and determine if natural clusters
(groups) exist in the data.

Summarize the use causes of text clustering.


Document retrieval
Fake news detection
Language Translation
Spam mail filtering
Taxonomy generation
How is text clustering differ from text classification.

Classification is a supervised learning approach that maps an


input to an output based on example input-output pairs.
Clustering is a unsupervised learning approach.

 Classification: If the prediction value tends to be category


like yes/no or positive/negative, then it falls under
classification type problem in machine learning. The
different classes are known in advance. For example, given
a sentence, predict whether it's a negative or positive
review.
 Clustering: Clustering is the task of partitioning the
dataset into groups called clusters. The goal is to split up
the data in such a way that points within single cluster are
very similar and points in different clusters are different. It
determines grouping among unlabelled data.
Compare and Contrast Clustering vs. Categorization
Clustering involves grouping similar data points or objects together based on their inherent
similarities. Clustering algorithms generally use unsupervised learning techniques to group
data points. categorization involves grouping objects or data points into predefined
categories or classes. Categorization algorithms use supervised learning techniques to
classify new data points

key difference between clustering and categorization is that clustering is often used to
identify new patterns and insights in a dataset. By contrast, categorization is used to classify
new data points based on pre-existing knowledge about the categories.

How do I define or extract textual features for clustering.

The mapping from textual data to real-valued vectors is called


feature extraction. One of the simplest techniques to
numerically represent text is Bag of Words (BOW). In BOW, we
make a list of unique words in the text corpus called vocabulary.
Then we can represent each sentence or document as a vector,
with each word represented as 1 for presence and 0 for absence.

Classify the levels of text clustering.


Document level
Word level
Sentence level
Mention some common text clustering algorithms.
K-means
Hierarchical
Graph based
Mixture model(Gaussian)
Density based clustering
What are the common challenges involved in text clustering.

 Selecting appropriate features of documents that should be


used for clustering.
 Selecting an appropriate similarity measure between
documents.
 Selecting an appropriate clustering method utilising the
above similarity measure.
 Implementing the clustering algorithm in an efficient way that
makes it feasible in terms of memory and CPU resources.
List some applications of K Means algorithm.
Spam detection
Document clustering
Image Segmentation
Market Segmentation
Sentiment analysis
Define clustering for document classification.
Clustering is a technique used for document classification that groups similar documents
together based on their content. It involves identifying patterns and similarities in the
documents and using those patterns to group the documents into clusters.

Preprocessing the documents

Choosing a clustering algorithm

Selecting features

Evaluating the clusters

Define Ward’s method.


Ward's method is a hierarchical clustering algorithm used to group data points or
observations into a hierarchy of nested clusters. It is a variance-based method, which means
that it seeks to minimize the sum of squared distances between the data points and their
cluster centroids. Ward's method can be computationally intensive and may not be suitable
for large datasets.
How can I evaluate the efficiency of a text clustering algorithm
1. Internal evaluation metrics.
2. External evaluation metrics
3. Visualization techniques
4. Domain-specific evaluation
5. Human evaluation
K-means algorithm can be treated as a two-phase approach, where? Justify
it.
Yes, the K-means algorithm can be treated as a two-phase approach, consisting of the
following two phases:

Initialization Phase: In this phase, the initial centroids of the clusters are selected randomly or
based on some prior knowledge of the data.

Iterative Refinement Phase: In this phase, each data point is assigned to the nearest centroid,
and the centroids are updated based on the mean of the data points assigned to them.

Disadvantages of k-means Clustering.


K-Means Clustering Algorithm has the following disadvantages-

 It requires to specify the number of clusters (k) in advance.


 It can not handle noisy data and outliers.
 It is not suitable to identify clusters with non-convex shapes.

Tell about CLARANS.


CLARANS, or Clustering Large Applications based on RANdomized Search, is a clustering
algorithm that was developed by Raymond T. Ng and Jiawei Han in 1994. It is a partitional
clustering algorithm that is similar to k-means, but it uses a different approach for finding the
optimal clustering solution. One of the main advantages is that it is able to handle large
datasets efficiently, since it does not require the entire dataset to be stored in memory at
once

List out methods to Measure the dissimilarity between two clusters


1. Single linkage
2. Complete Linkage
3. Average linkage
4. Centroid Linkage
5. Ward's method
6. Minimum distance.

You might also like