0% found this document useful (0 votes)
2 views39 pages

ITD253 L6 TextClassificationClustering

The document discusses text classification and clustering techniques, emphasizing the importance of text preprocessing and various learning methods such as supervised, unsupervised, and semi-supervised learning. It covers algorithms like Naïve Bayes, Logistic Regression, and Support Vector Machines for classification, as well as clustering methods like K-means and hierarchical clustering. Additionally, it highlights evaluation metrics such as precision, recall, and F1 score for assessing classifier performance.

Uploaded by

Chew Zhi Chao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views39 pages

ITD253 L6 TextClassificationClustering

The document discusses text classification and clustering techniques, emphasizing the importance of text preprocessing and various learning methods such as supervised, unsupervised, and semi-supervised learning. It covers algorithms like Naïve Bayes, Logistic Regression, and Support Vector Machines for classification, as well as clustering methods like K-means and hierarchical clustering. Additionally, it highlights evaluation metrics such as precision, recall, and F1 score for assessing classifier performance.

Uploaded by

Chew Zhi Chao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Text Classification and

1
Clustering
2 Recall Text Analytics Techniques

Word
Text Learning Result/
Preprocessing represent
source methods * Prediction
ation

Social Token Classifier, Target document,


Vector model
media (unigram, clustering summary,
or word
content, bigram etc) methods sentiment, topic
embedding
documents
format or
feature
extraction
3 Text preprocessing

 Recall from last lesson, there is a need to do preprocessing before any


analysis.

2. Normalization 3. Stop words


Source* 1. Tokenization
removal

5. Stemming and/or
Tokens 6. Replacing 4. Removal
Lemmatization

*Source can be a sentence, a paragraph, a document


Processes in grey (may or may not needed)
4 Word Representation - VSM
 In vector space model, term vector for an object of interest (paragraph,
document or document collection) is a vector in which each dimension
represents the weight of a given word in the document.

An illustration of term vectors for many documents, containing their TF-IDF values.
5 Learning methods

 In order to extract any pattern or information from the word representation,


a set of algorithms is needed for the purpose.
 These algorithms can be categorised into
 Supervised learning
Focus of this module
 Unsupervised learning
 Semi-supervised learning
 Ensemble learning
 Hybrid learning
6 Supervised learning

 Dataset : labelled data – 1) training data, 2) validation data, 3) testing data


 It is a machine learning techniques that infer a function or use a classifier to
learn from the training data in order to predict unseen data
 The following is the basic process:
 Gather data
 Annotate data Most tedious yet important
 Process data
 Feature generation and engineering
 Choose classifier
 Train data (and use validation data to tune the parameters of the classifier)
 Test data
7 Result of Classifiers

 In general, a classifier handles a two-classes dataset. For example, to


classify a tweet to have either positive or negative sentiment
 The classification is called hard, if a label is explicitly assigned (positive
sentiment)
 It is called a soft classification when a probability value is assigned (78%
positive, 23% negative)
 There are also multi-class classifiers that handle data of more than two
classes. For example, strong positive, positive, neutral, negative, strong
negative
 The common evaluation metrics for text classification are precision, recall
and F1 scores.
8 Evaluation Metrics
 Recall vs Precision
 Inversely related
 Recall (sensitivity)
| 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∩ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 |

{𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓}

 Ratio of number of relevant records


retrieved to the total number of relevant
records in the database
 Precision (positive predictive value)
| 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∩ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 |

{𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓}

 Ratio of the number of relevant records


retrieved to the total number of irrelevant
and relevant records retrieved
 Recall and precision are usually expressed
as percentage.
https://en.wikipedia.org/wiki/Precision_and_recall
9 Evaluation Metrics

 Balanced F-score or F-measure or F1


 Harmonice mean of precison and recall
 Defined as
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑥𝑥 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
2 𝑥𝑥
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

 F-score provides a measure of retrieval accuracy without bias toward


precision or recall
 The higher the F-score, the better the retrieval method or algorithm
10 Confusion Matrix

Labelled Positive Labelled Negative


Predicted Positive TRUE POSITIVE (TP) FALSE POSITIVE (FP)
Predicted Negative FALSE NEGATIVE (FN) TRUE NEGATIVE (TN)

Assuming that there are 10 labelled positive and 1000 labelled negative dataset
TP: 3, TN: 7, TN: 900, FP: 100

Accuracy: (TP + TN) / (TP + FN + TN + FP) => 903/1100 = 0.821

Precision: TP / (TP + FP) => 3/103 = 0.029


Recall: TP / (TP + FN) => 3/10 = 0.3

F1 score: 2 * (precision * recall)/(precision + recall) = 0.0174/0.329 = 0.053


11 Text Classification

 In text classification, the observations are documents and the classes are
document categories.
 For example, determine if the mail is a spam or not (two categories/classes)
or assign the topic to the news articles (multiple categories)
 How to represent a document?
doc = (w1, w2, …., wn)
 For simplicity, we ignore the order of words, POS, possible concepts but just
focus on the words.
 The representation is called the Bag-of-Words model
 We may assume that some words occur more frequently in a specific
category than others.
 Examples of Text Classifier are, Naïve Bayes Classifier, Logistic Regression,
Support Vector Machine
12 Naïve Bayes Classifier

 Simplest and most widely used classifier


 Model the distribution of documents in each class using a probabilistic
model assuming that the distribution of different terms are independent
from each other
 The naïve assumption is clearly false in real world application, but Naïve
Bayes works surprisingly well

this was a fun party” 


“this party was fun” 
“party fun was this”.

http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html
13 Being Naïve

 Since Naïve Bayes assumption is all words in doc are independent,

P(a very close game|Sports) = P(a|Sports) x P(very|Sports) x P(close|Sports) x P(game|Sports)

 Assuming the task is to classify if the doc “a very close game” belongs to
“Sports” or “Not Sports” categories
 Based on the labelled data, we will learn the probability P(game|Sports) ->
counting how many times the word “game” appears in Sports data divided
by total number of words in Sports category.
 Do the same for the Not Sports category, the bigger probability is the
assigned category
14 Logistic Regression

 Logistic regression is an algorithm for binary or two-class classification.


 Predict a score for each doc
 Threshold the score (e.g., cut off at 0.5) -> classification
 (z) is known as logistic sigmoid function and it outputs values between 0
and 1, which we can use to model and predict a class/category.

If (z) if larger than 0.5 (or if


z is larger than 0), the doc
will be classified as class 1
(and class 0, if otherwise)
15 Logistic Regression

 x is essentially the doc vector (either TFIDF or TF weighting)


 Activation functions is logistic sigmoid function

https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
16 Logistic Regression (LR) – an example

 Assumption a doc has 4 words – word1, word2, word3, word4


 And its corresponding TF and doc vector is x = [1, 2, 3, 4]
 Assuming the LR weight vector is w = [0.5, 0.5, 0.5, 0.5]
 Lets compute z:
z = wT x = 1*0.5 + 2*0.5 + 3*0.5 + 4*0.5 = 5

The result : φ(z=5) = 1 / (1 + e-5) = 0.993

 99.3% chance that this doc belong to class 1!

Logistic regression tends to have higher accuracy with more training data.
Naïve Bayes can have an advantage when the training data is small.
17 Support Vector Machine (SVM)
 SVM is a supervised learning method for two-class classification. (It can be
adapted to handle multi-class through iterating the one vs all approach)
 It separates a labelled {+1, -1} or annotated training data via a hyperplane
that is maximally distant from the positive and negative samples
respectively.
 This optimally separating hyperplane in the feature space corresponds to a
nonlinear decision boundary in the input space.

 is the
kernel
18 SVM Kernel trick
 The idea is, the data may not be linearly separable in our ‘n’ dimensional
space but may be linearly separable in a higher dimensional space:

 Due to the use of kernel, SVM is quite robust to high dimensionality, i.e.,
learning is almost independent of the dimensionality of the feature space.
 Vector Space model used by text data is often sparse with high dimension
and hence it is ideal choice for SVM.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78
19 Review Questions

 What is the difference between a binary classifier and multi-class


classifier?

 Can you convert a SVM to a multi-class classifier?


20 Unsupervised learning

 Unsupervised learning methods are techniques to find hidden structure out


of unlabelled data.
 There is no “training phase” like supervised learning
 Clustering and topic modelling are the two commonly used unsupervised
learning algorithms in the context of text data.
 Clustering is the task of segmenting a collection of documents into
partitions where documents in the same group (cluster) are more similar to
each other than those in other clusters.
 In topic modelling, a probabilistic model is used to determine a soft
clustering. A topic is like a cluster and the membership of a document to a
topic is probabilistic.
21 Text Clustering

 Clustering is the task of finding groups of similar documents in a


collection of documents.
 Text clustering can be different levels of granularities where clusters
can be documents, paragraphs, sentences or terms.
 Main technique used to organise documents to enhance retrieval.
 The type of algorithms are,
 Distanced-based clustering algorithm
 Partitioning algorithm
 Probabilistic clustering algorithm
22 Unique challenges for text clustering

 Text representation has a very large dimensionality, but the


underlying data is sparse. For example, the size of the vocabulary
can be of order of 105 but a document may have only a few
hundred words. Imagine if the document is a tweet!
 How to capture the concepts within collection of documents?
Words of the vocabulary are commonly correlated with each other.
Algorithms used should take the word correlation into consideration.
 Since words are the “deciding factors” in differentiating documents,
normalising document representations is important
23 Distanced-based Clustering algorithms

 It is based on a similarity function to measure the closeness between text


documents.
 One of it is Hierarchical Clustering algorithms,
 Top-down (divisive)
 Include all documents and split the cluster into sub-clusters

 Bottom-up (agglomerative)
 Each document is an individual cluster then merge similar documents
 Single Linkage Clustering – highest similarity between any pair of documents from the
two groups
 Group-Average Linkage Clustering – average similarity between pairs of documents
 Complete Linkage Clustering – worst case similarity between any pair of documents
https://www.youtube.com/watch?v=EUQY3hL38cw
24
25
26
27 Partitioning Clustering

 K-means clustering algorithm is widely used.


 Partition n documents into k clusters. k is defined by user.
 Main disadvantage of k-means clustering is the initial choice of k.
 It is an iterative process to find the “optimal centres” and groupings

K Means Clustering -
Georgia Tech - Machine
Learning
28 Probabilistic Clustering

 Topic modelling is one of the most popular probabilistic clustering


algorithms.
 The idea is to create a probabilistic generative model for the corpus of text
document.

The documents
are mixture of
topics and a
topic is a
probability
distribution over
words.
29 Topic modelling

 Unsupervised learning techniques that takes in documents (in vector


format) and a parameter – the number of topics (k)

Documents Topics Words (Tokens)

Observable Latent Observable

 Documents are composed by many topics


 Topics are composed by many words (tokens)
 Two main topic modelling methods:
 Probabilistic Latent Semantic Analysis (pLSA)
 Latent Dirichlet Allocation (LDA)
30 Semi-supervised learning

 It is a type of supervised learning but make use of unlabelled data for


training.
 Since it makes use of supervised learning, it uses a small amount of labelled
data with a large amount of unlabelled data.
 Prerequisite – the distribution of examples, which the unlabelled data will
help to elucidate, must be relevant for the classification problem.
 It works with some assumptions:
 The semi-supervised smoothness assumption – if two points are linked by a path
of high density (e.g., belong to the same cluster), then their outputs are likely to
be close.
 The cluster assumption – if points are in the same cluster, they are likely to be of
the same class (use the labelled points to assign a class to each cluster)
31 Semi-supervised learning in practice

 In speech recognition, it costs little to record huge amounts of speech, but


labelling it requires human to listen to it and type a transcript.
 Webpage classification – billions of webpages are available but to classify
them reliably requires human to read and verify
 Protein function prediction through 3D structure – protein sequences can
be acquired at industrial speed (by genome sequencing etc.), but to
resolve a 3D structure and to determine the function may require years of
scientific work.

Since unlabelled data carry less information than labelled data, they are
required in large amounts to increase prediction accuracy.
32 Ensemble learning

 Ensemble learning helps improve machine learning (supervised learning)


results by combining several models or diverse set of learners.
 This approach allows production of better predictive performance
compared to a single model. Commonly used in competitions, like kaggle
 Common errors in each learning models:

Bias error – quantify how much on


average the predicted values
different from actual values. High bias
– keeps missing important trends

Variance – quantify how the


prediction made on same
observation different from each
other. High variance will over-fit an
perform badly on unseen data

https://www.youtube.com/watch?v=XW2YPKXt3n4
33 Ensemble learning

 A champion model should maintain a balance between these two types of


errors. This is known as the trade-off management of bias-variance errors.
 Ensemble learning is one way to execute this trade off analysis.
 Common techniques
 Bagging

Implement similar learners


on small sample
populations and takes
average of all the
predictions.
34 Ensemble learning

 Boosting :
 Iterative technique which adjust the weight of an observation based on the
previous classification. If it was classified incorrectly, it tries to increase the weight
of the observation, and vice versa.
 Decreases bias error
 Stacking : Use a learner to combine output from different leaners

 Choosing the right ensemble can be an art.


35 Hybrid learning

 Combining knowledge-based approach with machine learning to achieve


better performance.
 What are knowledge-based approach?
 Use of knowledge base (e.g., database, lexicons)
 Use of complex structured and semi-structured information (e.g., relationship,
concepts)
 Mainly crafted by human’s knowledge and derived rules (can be more tedious
and considered “less cool” in this AI (or machine learning) age
 Still have its value, esp. in text analytics.
36 Hybrid learning

 Simplest sentiment analysis – combined polarity lexicons with a trained


classifier (using labelled training data).
 Applied in scarce resource language like Singlish.
37 Hybrid learning

 English Sentic Pattern:


 polarity reversing rule based on English negation terms such as “not”, “couldn’t”,
“shldnt”
 Handling of adversative terms such as “but”. Only the polarity of the second part
was considered. For example, “this bag is nice but expensive”

A Multilingual Semi-supervised approach in Deriving Singlish Sentic Patterns for Polarity Detection. S. L. Lo, E. Cambria, R. Chiong and D. Cornforth.
Knowledge Based Systems 105 (2016), 236-247
38 Which method(s) to use?

 Predict if the new email is a spam

 Decide which category to assign a document

 Recommend a list of social media followers as target audience


39 Any Question?

We have covered:
• Supervised and Unsupervised learning
• Supervised learning
• Naïve Bayes
• Logistic Regression
• Support Vector Machine
• Unsupervised learning
• Distance-based Clustering
• Partitioning Clustering
• Probabilistic Clustering
• Ensemble learning
• Hybrid learning

You might also like