Lectures 2 - MS CLASS Words and Text Classification
Lectures 2 - MS CLASS Words and Text Classification
Processing
Dr. Sajid Mahmood
2
Prerequisites
• No course prerequisites, but I will assume:
– some programming experience (no specific
language required)
– familiarity with basics of calculus, linear algebra,
and probability
– will be helpful to have taken a machine learning
course, but not strictly required
3
Grading
• 3 assignments (10%)
• midterm exam (20%)
• course project (20%):
– project proposal (5%)
– final report (15%)
• class participation, including quizzes (15%)
• Final (35%)
4
Assignments
• mixture of formal exercises, implementation,
experimentation, analysis
• first assignment will be posted this week so
that you can have a look at it, due 2 weeks
from Monday
5
Project
• Replicate [part of] a published NLP paper, or
define your own project
• The project must be done in a group of two
• Each group member will receive same grade
• More details to come
6
Collaboration Policy
• You are welcome to discuss assignments with
others in the course, but solutions and code
must be written individually
7
Optional Textbooks (1/2)
• Jurafsky & Martin. Speech and Language Processing, 2nd Ed. & 3rd Ed.
• Many chapters of 3rd edition are online
• Copies of 2nd edition available in library
8
Optional Textbooks (2/2)
• Goldberg. Neural Network Methods for Natural Language Processing.
• Earlier draft (from 2015) available online
9
What is natural language processing?
10
What is natural language processing?
an experimental computer science research area
that includes problems and solutions pertaining to
the understanding of human language
11
Text Classification
12
Text Classification
13
Sentiment Analysis
14
Sentiment Analysis
15
Machine Translation
16
Question Answering
17
Question Answering
18
Dialog Systems
19
figure credit: Phani Marupaka
Summarization
20
Summarization
21
Part-‐of-‐Speech
Tagging
22
Part-‐of-‐Speech
Tagging
proper proper
determiner verb (past) prep. noun noun poss. adj. noun
Some questioned if Tim Cook ’s first product
proper
modal verb det. adjective noun prep. noun punc.
would be a breakaway hit for Apple .
23
Word Prediction
24
Word Prediction
25
Other language technologies
(not typically considered core NLP):
• speech processing
• information retrieval / web search
• knowledge representation / reasoning
26
Why is NLP hard?
• ambiguity and variability of linguistic expression:
– variability: many forms can mean the same thing
– ambiguity: one form can mean many things
27
Example: Hyperlinks in Wikipedia
Wikipedia Articles
bar (law)
bar (establishment)
bar bar association
… bar (unit)
medal bar
… bar (music)
…
28
Example: Hyperlinks in Wikipedia
29
Ambiguity Variability
30
Word Sense Ambiguity
credit: A. Zwicky
31
Word Sense Ambiguity
credit: A. Zwicky
32
Meaning Ambiguity
33
Words
• what is a word?
• tokenization
• morphology
• lexical semantics
34
What is a word?
35
Tokenization
• tokenization: convert a character stream into
words by adding spaces
• for certain languages, highly nontrivial
• e.g., Chinese word segmentation is a widely-‐
studied NLP task
36
Tokenization
• for other languages (English), tokenization is
easier but is still not always obvious
• the data for your homework has been
tokenized:
– punctuation has been split off from
words
– contractions have been split
37
Intricacies of Tokenization
38
Intricacies of Tokenization
39
Intricacies of Tokenization
40
• Chinese and Japanese: no spaces between words:
– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃 现 居住 在 美国 东南部 的 佛罗里达
在 lives in US southeastern Florida
– Sharapova now
• Further complicated in Japanese, with multiple alphabets
intermingled
– Dates/amounts in multiple formats
J&M/SLP3
Removing Spaces?
• tokenization is usually about adding spaces
• but might we also want to remove spaces?
• what are some English examples?
42
Removing Spaces?
• tokenization is usually about adding spaces
• but might we also want to remove spaces?
• what are some English examples?
– names?
• New York NewYork
– non-‐compositional compounds?
• hot dog hotdog
– other artifacts of our spacing conventions?
• New York-‐Long
Island Railway ?
43
Types and Tokens
• once text has been tokenized, let’s count the words
• types: entries in the vocabulary
• tokens: instances of types in a corpus
• example sentence: If they want to go , they should go .
– how many types?
– how many tokens?
44
Types and Tokens
• once text has been tokenized, let’s count the words
• types: entries in the vocabulary
• tokens: instances of types in a corpus
• example sentence: If they want to go , they should go .
– how many types? 8
– how many tokens? 10
• type/token ratio: useful statistic of a corpus (here, 0.8)
45
Types and Tokens
• once text has been tokenized, let’s count the words
• types: entries in the vocabulary
• tokens: instances of types in a corpus
• example sentence: If they want to go , they should go .
– how many types? 8
– how many tokens? 10
• type/token ratio: useful statistic of a corpus (here, 0.8)
• as we add data, what happens to the type-‐token ratio?
46
“really” on Twitter
224571 really 50 reallllllly 15 reallllyy
1189 rly 48 reeeeeally 15 reallllllllly
1119 realy 41 reeally 15 reaallly
731 rlly 38 really2 14 reeeeeeally
590 reallly 37 reaaaaally 14 reallllyyyy
234 realllly 35 reallyyyyy 13 reeeaaally
216 reallyy 31 reely 12 rreally
156 relly 30 realllyyy 12 reaaaaaally
146 reallllly 27 realllyy 11 reeeeallly
132 rily 27 reaaly 11 reeeallly
104 reallyyy 26 realllyyyy 11 realllllyyy
89 reeeally 25 realllllllly 11 reaallyy
89 realllllly 22 reaaallly 10 reallyreallyreally
84 reaaally 21 really- 10 reaaaly
82 reaally 19 reeaally 9 reeeeeeeally
72 reeeeally 18 reallllyyy 9 reallys
65 reaaaally 16 reaaaallly 9 really-really
57 reallyyyy 15 realyy 9 r)eally
53 rilly 15 reallyreally 8 reeeaally
47
“really” on Twitter
8 reallyyyyyyy 6 realllllllllly 4 realllllllyyyy
8 reallyyyyyy 6 reaaaaaallly 4 reaalllyyy
8 realky 5 rrrreally 4 reaalllly
7 relaly 5 rrly 4 reaaalllyy
7 reeeeeeeeeally 5 rellly 4 reaaalllly
7 reeeealy 5 reeeeeeeeally 4 reaaaaly
7 reeeeaaally 5 reeeeaally 3 reeeeealllly
7 reallllllyyy 5 reeeeaaallly 3 reeeealllly
7 realllllllllllly 5 reeallyyy 3 reeeeaaaaally
7 reaaaaaaally 5 reallllllllllly 3 reeeaallly
7 raelly 5 reallllllllllllly 3 reeeaaallllyyy
7 r3ally 5 reaalllyy 3 reealy
6 r-really 5 reaaaalllly 3 reeallly
6 reeeaaalllyyy 5 reaaaaallly 3 reeaaly
6 reeeaaallly 4 rllly 3 reeaalllyyy
6 reeeaaaally 4 reeeeeeeeeeally 3 reeaalllly
6 realyl 4 reeealy 3 reeaaallly
6 r-e-a-l-l-y 4 reeaaaally 3 reallyyyyyyyyy
6 realllyyyyy 4 realllllyyyy 3 reallyl
48
“really” on Twitter
3 really) 2 rlyyyy 2 reeaallyy
3 r]eally 2 rlyyy 2 reeaalllyy
3 realluy 2 reqally 2 reeaallly
3 reallllyyyyy 2 rellyy 2 reeaaally
3 reallllllyyyyyyy 2 rellys 2 reaqlly
3 reallllllyyyy 2 reeely 2 realyyy
3 reallllllyy 2 reeeeeealy 2 reallyyyyyyyyyyyy
3 realllllllllllllllly 2 reeeeeallly 2 reallyyyyyyyy
3 realiy 2 reeeeeaally 2 really*
3 reaallyyyy 2 reeeeeaaally 2 really/
3 reaallllly 2 reeeeeaaallllly 2 realllyyyyyy
3 reaaallyy 2 reeeeallyyy 2 reallllyyyyyy
3 reaaaallyy 2 reeeeallllyyy 2 realllllyyyyyy
3 reaaaallllly 2 reeeeaaallllyyyy 2 realllllyy
3 reaaaaaly 2 reeeeaaalllly 2 reallllllyyyyy
3 reaaaaaaaally 2 reeeeaaaally 2 realllllllyyyyy
3 r34lly 2 reeeeaaaalllyyy 2 realllllllyy
2 rrreally 2 reeeallyy 2 reallllllllllllllly
2 rreeaallyy 2 reeallyy 2 reallllllllllllllllly
49
1 rrrrrrrrrrrrrrrreeeeeeeeeeeaaaaaaalllllllyyyyyy
1 rrrrrrrrrreally
1 rrrrrrreeeeeeaaaalllllyyyyyyy
1 rrrrrrealy
1 rrrrrreally
…
1 re-he-he-heeeeally
1 re-he-he-he-ealy
1 reheheally
1 reelllyy
1 reellly
1 ree-hee-heally
…
1 reeeeeeeeeaally
1 reeeeeeeeeaaally
1 reeeeeeeeeaaaaaalllyyy
1 reeeeeeeeeaaaaaaallllllllyyyyyyyy
1 reeeeeeeeeaaaaaaallllllllyyyyyyyy
1 reeeeeeeeeaaaaaaaaalllllllllyyyyyyyy
1 reeeeeeeeaaaaaaaalllllyyyyyy
50
1 reallyreallyreallyreallyreallyreallyreallyreallyreallyreally
reallyreallyreallyreallyreallyreallyreally
1 reallyreallyreallyreallyreallyr33lly
1 really/really/really
1 really(really
…
1 reallllllllyyyy
1 realllllllllyyyyyy
1 realllllllllyyyyy
1 realllllllllyyyy
1 realllllllllyyy
1 reallllllllllyyyyy
1 reallllllllllllyyyyyy
1 reallllllllllllllllllly
1 reallllllllllllllllllllly
1 reallllllllllllllllllllllyyyyy
1 reallllllllllllllllllllllllllly
1 realllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
lllllllly
51
How many words are there?
• how many English words exist?
• when we increase the size of our corpus, what
happens to the number of types?
52
How many words are there?
• how many English words exist?
• when we increase the size of our corpus, what
happens to the number of types?
– a bit surprising: vocabulary continues to grow in
any actual dataset
– you’ll just never see all the words
– in 1 million tweets, 15M tokens, 600k types
– in 56 million tweets, 847M tokens, ? types
53
How many words are there?
• how many English words exist?
• when we increase the size of our corpus, what
happens to the number of types?
– a bit surprising: vocabulary continues to grow in
any actual dataset
– you’ll just never see all the words
– in 1 million tweets, 15M tokens, 600k types
– in 56 million tweets, 847M tokens, 11M types
54
Classification of Language Data
• language classification
• topic classification
• author classification
• sentiment classification
• interestingness
• relevance
Document --> label (very partial list)
• language classification
are each of these binary / multi-class / multi-label?
• topic classification
• author classification
• sentiment classification
• interestingness
• relevance
Document --> label (very partial list)
• language classification
are each of these binary / multi-class / multi-label?
• topic classification
classification vs. ranking
• author classification in which of these ranking may be better?
• sentiment classification
• interestingness
• relevance
[ranking: assign a score to each label or to each item]
Sentence --> label
• mostly the same as in "document --> label"
but on a more granular level.
• Others?
f(
)
Representing
text as
Features
• Indicator features over events in the data.
counts words, characters, ngrams,
lemmas, stems, ...
Pre-processing:
Tokenization
(0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,...,,0,1,1,0,0,1,0)
soup
special
dog
salad
a
lamp
the
good
not
bad
onio
was
n
Representing
text as
Features
"the special onion soup was not very bad."
create, created, creating, creator, creativity create, created, creating, creator, creativity
create, create, create, creator, creativity creat, creat, creat, creat, creat
create, created, creating, creator, creativity create, created, creating, creator, creativity
create, create, create, creator, creativity creat, creat, creat, creat, creat
Note: structure of words is called "morphology". Will be covered more in depth next week.
ML over text.
Machine learning
• "Learn from data"
• Supervised Labeled examples
• We train a function f: f :x 2 X! y 2
Y
• Usually, the data-point x is represented as "features".
f :
$(x) 2 R m
! y2Y
Classification: the ML view
(assuming you already know ML from a different course)
• We train a function f: f :x 2 X! y 2
Y
• Usually, the data-point x is represented as "features".
f :
$(x) 2 R m
! y2Y
Feature Function
how do we represent an object?
f(
)?
Feature Function
perform measurements
and obtain features.
Feature Function
perform measurements
and obtain features.
Feature Function
perform measurements
and obtain features.
• Binary y 2 { —1, 1}
• Multi-class y 2 { 1, 2, ..., k }
• Multi-label y 2 2 {1,2,...,k}
• Regression* y2 R
*not really a "classification" problem
• (Structured)
[class: provide examples of each]
Types of classifiers
f (x ) =
y
Types of classifiers
x)
• Linear vs Non-linear
score( x , y ) Discriminative
f (x ) = Discriminative
y
Types of classifiers
Prob
• Generative vs Discriminative P(x, y) Generative
Prob
• Probabilistic vs Non-probabilistic P(y| Discriminative
x)
• Linear vs Non-linear Non-prob
score( x , y ) Discriminative
Non-prob
f (x ) = Discriminative
y
Popular Classifiers
• kNN (k nearest neighbors)
• Decision trees
• decision forests
• gradient-boosted trees
• Logistic regression
• SVM
• "Neural networks"
•
Popular Classifiers
kNN (k nearest
neighbors)
• Decision trees
In Python:
• decision forests scikit-learn (sklearn)
is a popular and good package.
• gradient-boosted
trees
• Logistic regression
• SVM
• "Neural networks"
Concepts you should know
• Training set, development set, test set.
• Loss function.
• Overfitting. Regularization.
• Evaluation metric.
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
=6
y|
is this a good metric?
when? when not?
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y|
• Majority-class baseline
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o
False Positive
o o
o
yˆ = —
False Negative
o
1
o
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o
False Positive
o o
o
yˆ = —
False Negative
o
1
o
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o
False Positive
o o
o
yˆ = —
False Negative
o
1
o
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o
False Positive
o o
o
yˆ = —
False Negative
o
1
o
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o
False Positive
o o
o
yˆ = —
False Negative
o
1
o
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o
False Positive
o o
o
yˆ = —
False Negative
o
1
"accuracy on positive class"?
"accuracy on negative class"?
o
"precision"?
"recall"?
This Photo by Unknown Author is licensed under CC BY-NC