UNIT 3 (2marks) TA

Text clustering involves grouping texts into clusters where texts within a cluster are more similar to each other than texts in other clusters. Text clustering is used for applications like document retrieval, fake news detection, language translation, and spam filtering. Clustering differs from classification in that clustering does not use predefined labels or categories and aims to discover natural groupings within the data.

Uploaded by

aathyukthas.ai20001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views4 pages

UNIT 3 (2marks) TA

Uploaded by

aathyukthas.ai20001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

UNIT 2 (2 marks)

Show the principle behind text clustering.

Text clustering is the task of grouping a set of unlabelled texts in such a way that
texts in the same cluster are more similar to each other than to those in other
clusters. Text clustering algorithms process text and determine if natural clusters
(groups) exist in the data.

Summarize the use causes of text clustering.

Document retrieval
Fake news detection
Language Translation
Spam mail filtering
Taxonomy generation
How is text clustering differ from text classification.

Classification is a supervised learning approach that maps an

input to an output based on example input-output pairs.
Clustering is a unsupervised learning approach.

 Classification: If the prediction value tends to be category

like yes/no or positive/negative, then it falls under
classification type problem in machine learning. The
different classes are known in advance. For example, given
a sentence, predict whether it's a negative or positive
review.
 Clustering: Clustering is the task of partitioning the
dataset into groups called clusters. The goal is to split up
the data in such a way that points within single cluster are
very similar and points in different clusters are different. It
determines grouping among unlabelled data.
Compare and Contrast Clustering vs. Categorization
Clustering involves grouping similar data points or objects together based on their inherent
similarities. Clustering algorithms generally use unsupervised learning techniques to group
data points. categorization involves grouping objects or data points into predefined
categories or classes. Categorization algorithms use supervised learning techniques to
classify new data points

key difference between clustering and categorization is that clustering is often used to
identify new patterns and insights in a dataset. By contrast, categorization is used to classify
new data points based on pre-existing knowledge about the categories.

How do I define or extract textual features for clustering.

The mapping from textual data to real-valued vectors is called

feature extraction. One of the simplest techniques to
numerically represent text is Bag of Words (BOW). In BOW, we
make a list of unique words in the text corpus called vocabulary.
Then we can represent each sentence or document as a vector,
with each word represented as 1 for presence and 0 for absence.

Classify the levels of text clustering.

Document level
Word level
Sentence level
Mention some common text clustering algorithms.
K-means
Hierarchical
Graph based
Mixture model(Gaussian)
Density based clustering
What are the common challenges involved in text clustering.

 Selecting appropriate features of documents that should be

used for clustering.
 Selecting an appropriate similarity measure between
documents.
 Selecting an appropriate clustering method utilising the
above similarity measure.
 Implementing the clustering algorithm in an efficient way that
makes it feasible in terms of memory and CPU resources.
List some applications of K Means algorithm.
Spam detection
Document clustering
Image Segmentation
Market Segmentation
Sentiment analysis
Define clustering for document classification.
Clustering is a technique used for document classification that groups similar documents
together based on their content. It involves identifying patterns and similarities in the
documents and using those patterns to group the documents into clusters.

Preprocessing the documents

Choosing a clustering algorithm

Selecting features

Evaluating the clusters

Define Ward’s method.

Ward's method is a hierarchical clustering algorithm used to group data points or
observations into a hierarchy of nested clusters. It is a variance-based method, which means
that it seeks to minimize the sum of squared distances between the data points and their
cluster centroids. Ward's method can be computationally intensive and may not be suitable
for large datasets.
How can I evaluate the efficiency of a text clustering algorithm
1. Internal evaluation metrics.
2. External evaluation metrics
3. Visualization techniques
4. Domain-specific evaluation
5. Human evaluation
K-means algorithm can be treated as a two-phase approach, where? Justify
it.
Yes, the K-means algorithm can be treated as a two-phase approach, consisting of the
following two phases:

Initialization Phase: In this phase, the initial centroids of the clusters are selected randomly or
based on some prior knowledge of the data.

Iterative Refinement Phase: In this phase, each data point is assigned to the nearest centroid,
and the centroids are updated based on the mean of the data points assigned to them.

Disadvantages of k-means Clustering.

K-Means Clustering Algorithm has the following disadvantages-

 It requires to specify the number of clusters (k) in advance.

 It can not handle noisy data and outliers.
 It is not suitable to identify clusters with non-convex shapes.

Tell about CLARANS.

CLARANS, or Clustering Large Applications based on RANdomized Search, is a clustering
algorithm that was developed by Raymond T. Ng and Jiawei Han in 1994. It is a partitional
clustering algorithm that is similar to k-means, but it uses a different approach for finding the
optimal clustering solution. One of the main advantages is that it is able to handle large
datasets efficiently, since it does not require the entire dataset to be stored in memory at
once

List out methods to Measure the dissimilarity between two clusters

1. Single linkage
2. Complete Linkage
3. Average linkage
4. Centroid Linkage
5. Ward's method
6. Minimum distance.

Clustering
No ratings yet
Clustering
28 pages
9.54 Class 13: Unsupervised Learning
No ratings yet
9.54 Class 13: Unsupervised Learning
54 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Text Clustering: (Part-2)
No ratings yet
Text Clustering: (Part-2)
88 pages
Ontology Modelling For FDA Adverse Event Reporting System
No ratings yet
Ontology Modelling For FDA Adverse Event Reporting System
5 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
ML Unsupervised
No ratings yet
ML Unsupervised
35 pages
Detection of Rooftop Regions in Rural Areas Using Support Vector Machine
No ratings yet
Detection of Rooftop Regions in Rural Areas Using Support Vector Machine
5 pages
37 Application of K Means Clustering
No ratings yet
37 Application of K Means Clustering
38 pages
IR Lec 36
No ratings yet
IR Lec 36
29 pages
12 Text Clustering
No ratings yet
12 Text Clustering
26 pages
Clustering
No ratings yet
Clustering
52 pages
DA-Unit V
No ratings yet
DA-Unit V
152 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Text Clustering
No ratings yet
Text Clustering
10 pages
Module-2 Part-1 - Merged
No ratings yet
Module-2 Part-1 - Merged
66 pages
Dsbda 5
No ratings yet
Dsbda 5
13 pages
Unit 3
No ratings yet
Unit 3
93 pages
06 Text Clustering
No ratings yet
06 Text Clustering
20 pages
Introduction To Machine Learning-Presentation
No ratings yet
Introduction To Machine Learning-Presentation
28 pages
Clusttering Powerpoint
No ratings yet
Clusttering Powerpoint
18 pages
Texthuff
No ratings yet
Texthuff
3 pages
Modeling - KNN, K-Means, Hierarchical
No ratings yet
Modeling - KNN, K-Means, Hierarchical
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
18 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Unit 4
No ratings yet
Unit 4
53 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
Mod 2
No ratings yet
Mod 2
10 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Clustering Algorithm With A Novel Similarity Measure: Gaddam Saidi Reddy, Dr.R.V.Krishnaiah
No ratings yet
Clustering Algorithm With A Novel Similarity Measure: Gaddam Saidi Reddy, Dr.R.V.Krishnaiah
6 pages
M Inning
100% (1)
M Inning
146 pages
Module 5
No ratings yet
Module 5
45 pages
Mod2 Clustering Text Book
No ratings yet
Mod2 Clustering Text Book
30 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Clustering
No ratings yet
Clustering
6 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Clustering Notes
No ratings yet
Clustering Notes
20 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Comparison of Graph Clustering Algorithms
No ratings yet
Comparison of Graph Clustering Algorithms
6 pages
Lab Report 4
No ratings yet
Lab Report 4
6 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
A Study On K-Means Clustering in Text Mining Using Python
No ratings yet
A Study On K-Means Clustering in Text Mining Using Python
5 pages
Introduce The Concept
No ratings yet
Introduce The Concept
8 pages
Clustering
No ratings yet
Clustering
38 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
NPTEL Introduction To Machine Learning Assignment 10 Answers
100% (1)
NPTEL Introduction To Machine Learning Assignment 10 Answers
7 pages
Clustering
No ratings yet
Clustering
29 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Enhanced Academic Performance Evaluation Technique Using Fuzzy System
No ratings yet
Enhanced Academic Performance Evaluation Technique Using Fuzzy System
12 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
K Means Example
No ratings yet
K Means Example
10 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Machine Learning Techniques For Anomaly Detection: An Overview
No ratings yet
Machine Learning Techniques For Anomaly Detection: An Overview
10 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Clustering New
No ratings yet
Clustering New
6 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
9 pages
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
No ratings yet
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
11 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Analyzing Crime in Chicago Through Machine Learning: Nathan Holt
No ratings yet
Analyzing Crime in Chicago Through Machine Learning: Nathan Holt
8 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
ImageGA Conf 2012
No ratings yet
ImageGA Conf 2012
8 pages
Community Detection
No ratings yet
Community Detection
72 pages
DMLab
No ratings yet
DMLab
27 pages
Unit V
No ratings yet
Unit V
165 pages
AI FUND Midterm Lab Exam - 100 - 100
No ratings yet
AI FUND Midterm Lab Exam - 100 - 100
17 pages
ML - 8
No ratings yet
ML - 8
70 pages
A New Metaheuristic Algorithm Based On Water Wave Optimization For Data Clustering
No ratings yet
A New Metaheuristic Algorithm Based On Water Wave Optimization For Data Clustering
25 pages
KienVu CV DS
No ratings yet
KienVu CV DS
2 pages
INFOCOM14
No ratings yet
INFOCOM14
6 pages
(2020TACL) E Cient Content-Based Sparse Attention With Routing Transformers
No ratings yet
(2020TACL) E Cient Content-Based Sparse Attention With Routing Transformers
24 pages
Objectives of Clustering
No ratings yet
Objectives of Clustering
3 pages
Extraai
No ratings yet
Extraai
11 pages
2018 8159 1 PB
No ratings yet
2018 8159 1 PB
9 pages
Short Quizzes 13-15
No ratings yet
Short Quizzes 13-15
9 pages
CS3491 AI and ML Important Question Bank
No ratings yet
CS3491 AI and ML Important Question Bank
7 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet