ITD253 L6 TextClassificationClustering
ITD253 L6 TextClassificationClustering
1
Clustering
2 Recall Text Analytics Techniques
Word
Text Learning Result/
Preprocessing represent
source methods * Prediction
ation
5. Stemming and/or
Tokens 6. Replacing 4. Removal
Lemmatization
An illustration of term vectors for many documents, containing their TF-IDF values.
5 Learning methods
Assuming that there are 10 labelled positive and 1000 labelled negative dataset
TP: 3, TN: 7, TN: 900, FP: 100
In text classification, the observations are documents and the classes are
document categories.
For example, determine if the mail is a spam or not (two categories/classes)
or assign the topic to the news articles (multiple categories)
How to represent a document?
doc = (w1, w2, …., wn)
For simplicity, we ignore the order of words, POS, possible concepts but just
focus on the words.
The representation is called the Bag-of-Words model
We may assume that some words occur more frequently in a specific
category than others.
Examples of Text Classifier are, Naïve Bayes Classifier, Logistic Regression,
Support Vector Machine
12 Naïve Bayes Classifier
http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html
13 Being Naïve
Assuming the task is to classify if the doc “a very close game” belongs to
“Sports” or “Not Sports” categories
Based on the labelled data, we will learn the probability P(game|Sports) ->
counting how many times the word “game” appears in Sports data divided
by total number of words in Sports category.
Do the same for the Not Sports category, the bigger probability is the
assigned category
14 Logistic Regression
https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
16 Logistic Regression (LR) – an example
Logistic regression tends to have higher accuracy with more training data.
Naïve Bayes can have an advantage when the training data is small.
17 Support Vector Machine (SVM)
SVM is a supervised learning method for two-class classification. (It can be
adapted to handle multi-class through iterating the one vs all approach)
It separates a labelled {+1, -1} or annotated training data via a hyperplane
that is maximally distant from the positive and negative samples
respectively.
This optimally separating hyperplane in the feature space corresponds to a
nonlinear decision boundary in the input space.
is the
kernel
18 SVM Kernel trick
The idea is, the data may not be linearly separable in our ‘n’ dimensional
space but may be linearly separable in a higher dimensional space:
Due to the use of kernel, SVM is quite robust to high dimensionality, i.e.,
learning is almost independent of the dimensionality of the feature space.
Vector Space model used by text data is often sparse with high dimension
and hence it is ideal choice for SVM.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78
19 Review Questions
Bottom-up (agglomerative)
Each document is an individual cluster then merge similar documents
Single Linkage Clustering – highest similarity between any pair of documents from the
two groups
Group-Average Linkage Clustering – average similarity between pairs of documents
Complete Linkage Clustering – worst case similarity between any pair of documents
https://www.youtube.com/watch?v=EUQY3hL38cw
24
25
26
27 Partitioning Clustering
K Means Clustering -
Georgia Tech - Machine
Learning
28 Probabilistic Clustering
The documents
are mixture of
topics and a
topic is a
probability
distribution over
words.
29 Topic modelling
Since unlabelled data carry less information than labelled data, they are
required in large amounts to increase prediction accuracy.
32 Ensemble learning
https://www.youtube.com/watch?v=XW2YPKXt3n4
33 Ensemble learning
Boosting :
Iterative technique which adjust the weight of an observation based on the
previous classification. If it was classified incorrectly, it tries to increase the weight
of the observation, and vice versa.
Decreases bias error
Stacking : Use a learner to combine output from different leaners
A Multilingual Semi-supervised approach in Deriving Singlish Sentic Patterns for Polarity Detection. S. L. Lo, E. Cambria, R. Chiong and D. Cornforth.
Knowledge Based Systems 105 (2016), 236-247
38 Which method(s) to use?
We have covered:
• Supervised and Unsupervised learning
• Supervised learning
• Naïve Bayes
• Logistic Regression
• Support Vector Machine
• Unsupervised learning
• Distance-based Clustering
• Partitioning Clustering
• Probabilistic Clustering
• Ensemble learning
• Hybrid learning