0% found this document useful (0 votes)
43 views36 pages

Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24

Uploaded by

john
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views36 pages

Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24

Uploaded by

john
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Text, Web and Social Media

Analytics
SE Computer, Sem VIII
Academic Year : 2023 – 24

Clustering and Classification: Text Clustering


Kadambari Deherkar
5th February, 2024
Clustering
The clustering problem is defined to be that of finding groups of similar
objects in the data.
The similarity between the objects is measured with the use of a similarity
function.
Various Similarity Functions are:
1. Euclidean Distance
2. Manhatten Distance
3. Minkovisky Distance
4. Cosine Similarity
5. Jaccard SImilarity
Clustering in Text

• Objects to be clustered can be of different granularity


✓Documents
✓Paragraph
✓Sentences
✓Terms

Clustering is useful for organizing documents to improve retrieval and


support browsing.
Applications of Text Clustering

Document Organization and Browsing:

• Hierarchical organization of documents into coherent categories


• Enables systematic browsing of the document collection
• Documents can be clustered by how similar they are.
• Grouping news articles by topic to improve search results and
recommendation systems
• Organizing customer feedback by theme helps identify critical issues and
improve customer service.
• Clustering research papers by topic, assists with the literature review and
research work organization.
Applications of Text Clustering
Corpus summarization:

• Clustering techniques provide a coherent summary of the collection in the


form of cluster-digests or word-clusters.
• These can be used in order to provide summary insights into the overall
content of the underlying corpus.
• Variants of these methods, such as sentence clustering, can also be used for
document summarization.
• Clustering can be used to find hidden topics in text documents, which can
then determine how the data is organized.
Applications of Text Clustering
Classifying text:
• Clustering can be used as a preprocessing step to put text documents into
categories that have already been set up.

Topic modelling:
• Clustering can be used to find hidden topics in text documents, which can
then determine how the data is organised.
• Topic modeling is an unsupervised machine learning technique that's capable
of scanning a set of documents, detecting word and phrase patterns within
them, and automatically clustering word groups and similar expressions that
best characterize a set of documents.
Text Clustering Algorithms - Concerns
• General purpose algorithms such as K-Means and Hierarchical Clustering can be
extended to perform text clustering.
• Unique properties of text data necessitates consideration to factors such as
dimensionality of the text, representations of the text features such as feature
vectors, and other consideration such as frequency normalization as the
number of words in each document could be different.
• These different representations necessitate the design of different classes of
clustering algorithms.
• Text clustering algorithms are divided into a wide variety of different types such
as agglomerative clustering algorithms, partitioning algorithms.
• Different clustering algorithms have different tradeoffs in terms of effectiveness
and efficiency.
Preparing Text Data before Clustering
• Text Cleaning: This step eliminates extraneous or unnecessary text, including punctuation, stop words,
special characters, and numbers. By eliminating data noise, this step enhances the clustering algorithm’s
accuracy.

• Tokenization : Tokenization: In this step, the text is broken up into tokens, single words or phrases, so the
algorithm can figure out what it means.

• Stemming/Lemmatizaion: This step brings the words to its base form. In this step, similar words are grouped
to help reduce the dataset’s dimensionality.

• Vectorization: In this step, the text is transformed into a numerical representation that the clustering
algorithm can use as input. Bag-of-words, term frequency-inverse document frequency (TF-IDF).

• Dimensionality Reduction: By removing correlated features or projecting the data onto a lower-dimensional
space, this step reduces the number of features. When the dataset has a lot of features and the clustering
algorithm needs to be sped up, this step is especially helpful.
Feature Selection

• It is the process of choosing a subset of relevant features/words or terms from the entire set of
features available in a text dataset.

• It's about picking the most important words or elements that contribute the most to the process.

• Text data often involves a large number of features (words), and not all of them may be equally
useful for clustering.

• Feature selection helps in reducing the dimensionality of the data, making the clustering
algorithm more efficient and effective.

• It focuses on retaining the most informative words that capture the essence of the text
documents.
Feature Selection
• Reduces Overfitting: Less redundant data means less opportunity to
make decisions based on noise.
• Improves Accuracy: Less misleading data means modeling accuracy
improves.
• Reduces Training Time: fewer data points reduce algorithm
complexity and algorithms train faster.
Feature Selection
The quality of any data mining method, such as classification and
clustering, is highly dependent on the noisiness of the features that are
used for the clustering process.

For example, commonly used words such as “the”, may not be


very useful in improving the clustering quality.

Therefore, it is critical to select the features effectively, so that the noisy


words in the corpus are removed before the clustering.
Feature Selection

• Document Frequency Selection Method


• Term Strength
• Entropy Based Ranking
• Term Contribution
Feature Selection : Document Frequency Selection
Method

1.Document Frequency (DF): It tells us how many documents contain a


specific word. If a word is in many documents, it has a high document
frequency.
2.Inverse Document Frequency (IDF): It's a measure that gives more weight to
words that are rare across documents. If a word is rare and appears in only a
few documents, it gets a higher IDF.
3.Focuses on removal of stop words such as a, an, the , or , of
4. When dealing with words that appear very often, it's important to remove
them if they don't help in telling the difference between different groups of
documents.
5. So here the focus is on words that are both common enough to be relevant but
also unique enough to help us understand and differentiate between different
clusters of documents.
Feature Selection: Term Strength

• Term Strength is a technique for Feature Selection in Text Mining.


• it doesn't need a pre-defined list of Stop Words - it discovers them
automatically
• So it's a technique for vocabulary reduction in text retrieval
• This method estimates term importance based on how often a term
appears in "related" documents
Feature Selection: Term Strength

• The term strength is essentially used to measure how informative a


word is for identifying two related documents.
• For example, for two related documents x and y, the term strength
s(t)of term t is defined in terms of the following probability:
s(t)=P(t ∈ y|t ∈ x)
for two related documents x,y what's the probability that t belongs to y given it belongs to x

• It is possible to use automated similarity functions such as the cosine


function to define the relatedness of document pairs.
Feature Selection: Term Strength
• A pair of documents are defined to be related if their cosine similarity
is above a user-defined threshold.
• In such cases, the term strength s(t) can be defined:

• It does not require initial supervision or training data for the feature
selection, which is a key requirement in the unsupervised scenario.
• More suitable to suited to similarity-based clustering
Feature Selection: Entropy Based
• The quality of the term is measured by the entropy reduction when it
is removed.
• The entropy E(t) of the term t in a collection of n documents is
defined as follows:

Here Sij ∈ (0,1) is the similarity between the ith and jth document in
the collection, after the term t is removed.
Distance Based Clustering – Cosine Similarity
Cosine similarity is measured mathematically as the dot product of the
vectors divided by their magnitude.
For example, if we have two vectors, A and B, the similarity between
them is calculated as,
Distance Based Clustering Algorithms
• Are based on similarity function to measure the closeness between
the text objects.
• The most well known similarity function which is used commonly in
the text domain is the cosine similarity function.

✓Hierarchical Agglomerative Clustering


✓Partition based Methods
• K- Medoid Clustering
• K – Mean Clustring
Agglomerative and Hierarchical Clustering Algorithms
• The general concept of agglomerative clustering is to successively merge documents into clusters based on their similarity
with one an other.

• This approach successively merges groups based on the best pairwise similarity between these groups of documents.

• The method of agglomerative hierarchical clustering is useful to support a variety of searching methods.

• Multiple ways are employed to merge the documents pairwise.

• It is based on how this pairwise similarity is computed between the different groups of documents. The similarity between
a pair of groups may be computed as the
best-case similarity,
average case similarity,
worst-case similarity

• Creates cluster hierarchy/ dendrogram.

• Leaf nodes represent individual documents and intermediate nodes represents clusters.
Methods for Merging Documents in Hierarchical Clustering

• Single Linkage
• Group – Average Linkage
• Complete Linkage
Methods for Merging Documents in Hierarchical Clustering
Single Linkage Clustering
• It merges two groups which are such that their closest pair of documents have
the highest similarity compared to any other pair of groups.
Advantages:
1. Efficient to implement in practice
2. can first compute all similarity pairs and sort them in order of reducing
similarity.
Disadvantage:
1. Can lead to chaining phenomenon, where different documents can be grouped
into same cluster.
2. if A is similar to B and B is similar to C, it does not always imply that A is similar
to C, because of lack of transitivity in similarity computations.
Methods for Merging Documents in Hierarchical Clustering

Group-Average Linkage Clustering


In group-average linkage clustering, the similarity between two clusters is the
average similarity between the pairs of documents in the two clusters.

Advantages
Does not exhibit chaining behavior as in single linkage clustering
Forms robust clusters

Disadvantages
Process is slower than simple clustering
Need to determine the average similarity between a large number of pairs in
order to determine group-wise similarity
Methods for Merging Documents in Hierarchical Clustering

Complete Linkage Clustering


Similarity between two clusters is the worst-case similarity between any pair
of documents in the two clusters, the distance between two clusters is the
maximum distance between members of the two clusters.
Advantages
Complete-linkage clustering also avoids chaining.
Also avoids the placement of any pair of very disparate documnets in the same
cluster.
Disadvantages
It is computationally more expensive than the single-linkage method.
NMF Algorithm
Example:
Word and Phrase based Clustering
• Word and phrase-based clustering involves grouping words or phrases
that are semantically similar or related.
• These clustering techniques are commonly used in natural language
processing (NLP) and text mining to organize and understand the
structure of textual data.
• Here are two approaches for word and phrase-based clustering
Word Embedding-based Clustering
Objective: Group words based on their semantic similarity in a continuous vector space.
Algorithm:
1. Word Embedding Training:
1. Train a word embedding model (e.g., Word2Vec, GloVe, FastText) on a large corpus to learn vector
representations for words.
2. Embedding Similarity:
1. Use the learned word embeddings to compute the similarity or distance between words in the vector space
(e.g., cosine similarity).
3. Clustering:
1. Apply a clustering algorithm (e.g., K-Means, hierarchical clustering, DBSCAN) to group words based on their
embedding similarities.
Pros:
• Captures semantic relationships between words.
• Works well for words with multiple meanings.
Cons:
• May struggle with rare or out-of-vocabulary words.
• High-dimensional embeddings may require dimensionality reduction before clustering.
Phrase-based Clustering
Objective: Group phrases or multi-word expressions that convey a common semantic
meaning.
Algorithm:
1.Text Preprocessing:
1. Tokenize and preprocess the text to identify phrases or multi-word expressions.
2.Feature Extraction:
1. Represent each phrase as a feature vector using methods like Bag-of-Words or TF-IDF.
3.Clustering:
1. Apply a clustering algorithm (e.g., K-Means, hierarchical clustering) to group similar phrases based
on their feature vectors.
Pros:
• Explicitly considers multi-word expressions.
• Suitable for capturing domain-specific terms and phrases.
Cons:
• May require careful preprocessing to identify meaningful phrases.
• Sensitivity to noise in the data.
Considerations
• Data Quality: The effectiveness of word and phrase-based clustering
depends on the quality and representativeness of the input data.
• Clustering Algorithm: The choice of clustering algorithm and
parameters can significantly impact the results. Experiment with
different algorithms to find the most suitable one for the task.
• Interpretability: Consider the interpretability of the clusters and
whether they align with the semantic relationships or topics of
interest.
Probabilistic document Clustering
• Probabilistic document clustering involves grouping documents based
on a probabilistic framework that models the likelihood of a
document belonging to a particular cluster.
• These models often use probability distributions to represent the
uncertainty associated with document assignments to clusters.
• One common approach in probabilistic document clustering is the use
of probabilistic topic models. Latent Dirichlet Allocation (LDA) is a
prominent example of such a model.
Latent Dirichlet Allocation (LDA)
Objective: Discover latent topics in a collection of documents and assign
documents to these topics based on the distribution of topics.
Algorithm:
1.Assumptions:
1. Assumes that documents are mixtures of topics, and each topic is a mixture of
words.
2.Model Parameters:
1. K is the number of topics.
2. α is the parameter controlling the document-topic distribution.
3. β is the parameter controlling the topic-word distribution.
3.Document-Topic Assignment:
1. Each document is assumed to be a mix of topics. The distribution of topics for a
document is represented by a Dirichlet distribution.
4.Topic-Word Assignment:
1. Each topic is assumed to be a mix of words. The distribution of words for a topic is
represented by another Dirichlet distribution.
5. Generative Process:
1. For each document:
1. Choose a distribution of topics.
2. For each word in the document:
1. Choose a topic from the topic distribution.
2. Choose a word from the word distribution of the chosen topic.
6. Inference:
1. Given a collection of documents, infer the most likely topic distribution for each document and
the most likely word distribution for each topic.
7. Document Clustering:
1. Group documents based on their inferred topic distributions.

Pros:
• Provides a probabilistic framework for representing document-topic relationships.
• Automatically discovers topics in an unsupervised manner.
• Documents can belong to multiple topics with different probabilities.
Cons:
• Assumes that each document is a mixture of topics and may not handle overlapping or
non-exclusive document topics well.
• Requires careful tuning of hyperparameters.

You might also like