Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
Analytics
SE Computer, Sem VIII
Academic Year : 2023 – 24
Topic modelling:
• Clustering can be used to find hidden topics in text documents, which can
then determine how the data is organised.
• Topic modeling is an unsupervised machine learning technique that's capable
of scanning a set of documents, detecting word and phrase patterns within
them, and automatically clustering word groups and similar expressions that
best characterize a set of documents.
Text Clustering Algorithms - Concerns
• General purpose algorithms such as K-Means and Hierarchical Clustering can be
extended to perform text clustering.
• Unique properties of text data necessitates consideration to factors such as
dimensionality of the text, representations of the text features such as feature
vectors, and other consideration such as frequency normalization as the
number of words in each document could be different.
• These different representations necessitate the design of different classes of
clustering algorithms.
• Text clustering algorithms are divided into a wide variety of different types such
as agglomerative clustering algorithms, partitioning algorithms.
• Different clustering algorithms have different tradeoffs in terms of effectiveness
and efficiency.
Preparing Text Data before Clustering
• Text Cleaning: This step eliminates extraneous or unnecessary text, including punctuation, stop words,
special characters, and numbers. By eliminating data noise, this step enhances the clustering algorithm’s
accuracy.
• Tokenization : Tokenization: In this step, the text is broken up into tokens, single words or phrases, so the
algorithm can figure out what it means.
• Stemming/Lemmatizaion: This step brings the words to its base form. In this step, similar words are grouped
to help reduce the dataset’s dimensionality.
• Vectorization: In this step, the text is transformed into a numerical representation that the clustering
algorithm can use as input. Bag-of-words, term frequency-inverse document frequency (TF-IDF).
• Dimensionality Reduction: By removing correlated features or projecting the data onto a lower-dimensional
space, this step reduces the number of features. When the dataset has a lot of features and the clustering
algorithm needs to be sped up, this step is especially helpful.
Feature Selection
• It is the process of choosing a subset of relevant features/words or terms from the entire set of
features available in a text dataset.
• It's about picking the most important words or elements that contribute the most to the process.
• Text data often involves a large number of features (words), and not all of them may be equally
useful for clustering.
• Feature selection helps in reducing the dimensionality of the data, making the clustering
algorithm more efficient and effective.
• It focuses on retaining the most informative words that capture the essence of the text
documents.
Feature Selection
• Reduces Overfitting: Less redundant data means less opportunity to
make decisions based on noise.
• Improves Accuracy: Less misleading data means modeling accuracy
improves.
• Reduces Training Time: fewer data points reduce algorithm
complexity and algorithms train faster.
Feature Selection
The quality of any data mining method, such as classification and
clustering, is highly dependent on the noisiness of the features that are
used for the clustering process.
• It does not require initial supervision or training data for the feature
selection, which is a key requirement in the unsupervised scenario.
• More suitable to suited to similarity-based clustering
Feature Selection: Entropy Based
• The quality of the term is measured by the entropy reduction when it
is removed.
• The entropy E(t) of the term t in a collection of n documents is
defined as follows:
Here Sij ∈ (0,1) is the similarity between the ith and jth document in
the collection, after the term t is removed.
Distance Based Clustering – Cosine Similarity
Cosine similarity is measured mathematically as the dot product of the
vectors divided by their magnitude.
For example, if we have two vectors, A and B, the similarity between
them is calculated as,
Distance Based Clustering Algorithms
• Are based on similarity function to measure the closeness between
the text objects.
• The most well known similarity function which is used commonly in
the text domain is the cosine similarity function.
• This approach successively merges groups based on the best pairwise similarity between these groups of documents.
• The method of agglomerative hierarchical clustering is useful to support a variety of searching methods.
• It is based on how this pairwise similarity is computed between the different groups of documents. The similarity between
a pair of groups may be computed as the
best-case similarity,
average case similarity,
worst-case similarity
• Leaf nodes represent individual documents and intermediate nodes represents clusters.
Methods for Merging Documents in Hierarchical Clustering
• Single Linkage
• Group – Average Linkage
• Complete Linkage
Methods for Merging Documents in Hierarchical Clustering
Single Linkage Clustering
• It merges two groups which are such that their closest pair of documents have
the highest similarity compared to any other pair of groups.
Advantages:
1. Efficient to implement in practice
2. can first compute all similarity pairs and sort them in order of reducing
similarity.
Disadvantage:
1. Can lead to chaining phenomenon, where different documents can be grouped
into same cluster.
2. if A is similar to B and B is similar to C, it does not always imply that A is similar
to C, because of lack of transitivity in similarity computations.
Methods for Merging Documents in Hierarchical Clustering
Advantages
Does not exhibit chaining behavior as in single linkage clustering
Forms robust clusters
Disadvantages
Process is slower than simple clustering
Need to determine the average similarity between a large number of pairs in
order to determine group-wise similarity
Methods for Merging Documents in Hierarchical Clustering
Pros:
• Provides a probabilistic framework for representing document-topic relationships.
• Automatically discovers topics in an unsupervised manner.
• Documents can belong to multiple topics with different probabilities.
Cons:
• Assumes that each document is a mixture of topics and may not handle overlapping or
non-exclusive document topics well.
• Requires careful tuning of hyperparameters.