Text Classification
Text Classification
Besat Kassaie
2 Outline
Introduction
Text classification definition
Naive Bayes
Vector Space Classification
Rocchio Classficiation
KNN Classification
New Applications
3 Spam Detection
4 Authorship identification
Was Mark Twain the writer of “Quintus Curtius Snodgrass” (QCS) letters?
Mark Twain's role in the Civil War has been a subject of dispute for years.
The evidence of Twain’s possible military connection in New Orleans was
from content of ten letters published in New Orleans’ Daily Crescent.
In these letters, which have been credited to Twain and were signed
“Quintus Curtius Snodgrass” (QCS), the writer described his military
adventures.
Bringar (1963) applied statistical tests to QCS letters.
Bringar used word frequency distribution to determine Mark Twain’s writing
style
Mark Twain was not the author of the disputed letters!
5 Gender Identification
Applications:
Marketing, personalization, legal investigation
6 Sentiment Analysis
Input:
a document d
a fixed set of classes C={c1,c2,…,cj}
Manual
Many classification tasks have traditionally been solved manually.
e.g. Books in a library are assigned Library of Congress categories by a librarian
manual classification is expensive to scale
Hand-crafted rules
A rule captures a certain combination of keywords that indicates a class.
e.g. (multicore OR multi-core) AND (chip OR processor OR microprocessor)
good scaling properties
creating and maintaining them over time is labor intensive
Input:
a document d
a fixed set of classes C={c1,c2,…,cj}
a training set of m labeled documents (d1,c1), (d2,c2),…,(dm,cm)
γ( )=c
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
15 The bag of words representation
γ( )=c
16 Features
𝑃 𝑑 𝑐 𝑃(𝑐)
𝑃 𝑐𝑑 =
𝑃(𝑑)
18 Naïve Bayes Classifier
𝑃 𝑑 𝑐 𝑃(𝑐)
= 𝑎𝑟𝑔𝑚𝑎𝑥 Bayes Rule
𝑐∈𝐶 𝑃(𝑑)
Dropping the
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑑 𝑐 𝑃(𝑐) denominator
𝑐∈𝐶
Document d
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑃 𝑥1 , 𝑥1 , … , 𝑥𝑛 𝑐 𝑃(𝑐) represented as Features
𝑐∈𝐶 x1..xn
19 Multinomial NB VS Bernoulli NB
Bernoulli:
𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑒1 , 𝑒2 , … , 𝑒𝑣 𝑐 𝑃(𝑐)
𝑐∈𝐶
𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑐 = 𝑃 𝑥1 𝑐 ∙ 𝑃 𝑥2 𝑐 ∙ ⋯ ∙ 𝑃 𝑥𝑛 𝑐
Multinomial:
𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝑐 𝑃 𝑐
𝑐∈𝐶
Bernoulli:
𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑒1 , 𝑒2 , … , 𝑒𝑣 𝑐 𝑃 𝑐
𝑐∈𝐶
positional independence: The conditional probabilities for a term are the same,
independent of position in the document.
E.g. P(Xk1 = t|c) = P(Xk2 = t|c)
𝑐𝑚𝑎𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑤1 , 𝑤2 , … , 𝑤𝑛 𝑐 𝑃 𝑐
𝑐∈𝐶
Fraction of documents
𝐶𝑜𝑢𝑛𝑡 𝐶 = 𝑐𝑗 which are labeled as
𝑃 𝑐𝑗 =
𝑁𝑑𝑜𝑐 class cj
𝐶𝑜𝑢𝑛𝑡 𝑤𝑖, 𝑐𝑗 + 1
𝑃 𝑤𝑖 𝑐𝑗 =
𝑤𝜖𝑉 𝐶𝑜𝑢𝑛𝑡 𝑤, 𝑐𝑗 + 1
𝐶𝑜𝑢𝑛𝑡 𝑤𝑖, 𝑐𝑗 + 1
=
𝑤𝜖𝑉 𝐶𝑜𝑢𝑛𝑡 𝑤, 𝑐𝑗 + |𝑉|
26 Laplace (add-1) smoothing: unknown
words
Add one extra word to the vocabulary, the “unknown word” , wu
𝐶𝑜𝑢𝑛𝑡 𝑤𝑢, 𝑐𝑗 + 1
𝑃 𝑤𝑢 𝑐𝑗 =
( 𝑤𝜖𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐𝑗 )) + |𝑉 + 1|
1
=
( 𝑤𝜖𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐𝑗 )) + |𝑉 + 1|
Basic idea: Rocchio forms a simple representative for each class by using
the centroids.
To compute the centroid of each class:
1
𝜇 𝑐 = 𝑑𝜖𝐷𝑐 𝑣(𝑑)
|𝐷𝑐|
Does not work well for classes than cannot be accurately represented by a
single “center”
For kNN we assign each document to the majority class of its k closest
neighbors where k is a parameter.
Decision boundary for 1NN (double line): defined along the regions of
Voronoi cells for the objects in each class. Shows the non-linearity of kNN
41 kNN: Probabilistic version
𝑆𝑐𝑜𝑟𝑒(𝑐, 𝑑) = 𝐼𝑐 𝑑 cos(𝑣 𝑑 , 𝑣 𝑑 )
𝑑𝜖𝑆𝑘 (𝑑)
No training necessary
Scales well with large number of classes
Don’t need to train n classifiers for n classes
May be expensive at test time
In most cases it’s more accurate than NB or Rocchio
44 Some Recent Applications of Text
Classification
Social networks are rich source of text data e.g. twitter
Many new applications are emerging based on these sources
We will discuss three recent works based on twitter data
“Learning Similarity Functions for Topic Detection in Online Reputation
Monitoring. Damino Spina, et. Al, SIGIR, 2014”
“Target-dependent Churn Classification in Microblogs, Hadi Amiri, et.al,
AAAI 2015”
“Fine-Grained Location Extraction from Tweets with Temporal Awareness,
Chenliang Li, SIGIR, 2014”
45 Learning Similarity Functions for Topic
Detection in Online Reputation Monitoring
What are people are saying about a given entity (company, brand,
organization, personality,…)
Is there any issues that may damage the reputation of the entity?
In order to answer such questions, reputation experts have to daily monitor
social networks such as twitter
The paper aim is to solve this problem automatically as topic detection task
Reputation alerts must be detected early before they explode. So there are
a few number of related tweets
Probabilistic generative approaches are less appropriate here because of
data sparsity
46 Learning Similarity Functions…
SVM
Semantic features:
Useful for tweets that do not have words in common.
Used Wikipedia as a knowledge based to find semantically related words. e.g. mexico,
mexicanas
Metadata features:
Such as author, URLs, hashtags,…
Time-aware features:
Time stamp of tweets
48 Fine-Grained Location Extraction from
Tweets with Temporal Awareness
Through tweets, users often casually or implicitly reveal their current
locations and short term plans where to visit next, at fine-grained granularity
…
52
THANK YOU
53 References
https://web.stanford.edu/class/cs124/lec/sentiment.pptx
https://web.stanford.edu/class/cs124/lec/naivebayes.pdf
An introduction to Information Retrieval, Christopher D. Manning
,Prabhakar Raghavan,Hinrich Schütze,2009
Learning Similarity Functions for Topic Detection in Online
Reputation Monitoring. Damino Spina, et. Al, SIGIR, 2014
Target-dependent Churn Classification in Microblogs, Hadi Amiri,
et.al, AAAI 2015
Fine-Grained Location Extraction from Tweets with Temporal
Awareness, Chenliang Li, SIGIR, 2014