0% found this document useful (0 votes)

20 views31 pages

Lecture 02

Uploaded by

1162407364

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views31 pages

Lecture 02

Uploaded by

1162407364

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Natural Language

Processing
Lectures 2: Language Classification. Probability Review.
Machine Learning Background. Naive Bayes’ Classifier.

10/29/2020

COMS W4705
Yassine Benajiba
Text Classification
• Given a representation of some document d, identify which
class the document belongs to.

computers

“How long does it take politics

a smoker's lungs to
clear of the tar after religion
quitting?
Does your chances of medicine
getting lung cancer
decrease quickly or science
does it take
for-sale
a considerable amount
of time for that to
autos
happen?”
sports
From the 20-Newsgroups data set:
http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
Text Classification
• Applications:

• Spam detection.

• Mood / Sentiment detection.

• Author identification.

• Identifying political affiliation.

• Word Sense Disambiguation.

• …
Text Classification
• This is a machine learning problem.

• How do we represent each document?

(feature representation).

• Can use different ML techniques.

• Supervised ML: Fixed set of classes C.

Train a classifier from a set of labeled <document,class> pairs.

• Discriminative vs. Generative models.

• Unsupervised ML: Unknown set of classes C.

Topic modeling.
Types of Feedback
• Supervised learning: Given a set of input-output pairs, learn a
function that maps inputs to outputs.

• Unsupervised learning: Learn patterns in the input without any

explicit feedback.
One typical approach: clustering, identify clusters of input
examples.

• Semi-supervised learning: Start with a few labeled

input/output pairs, then use a lot of unlabeled data to improve.

• Reinforcement learning: Start with a policy determining the

agent’s actions. Feedback in the form of reward or punishment.
Supervised Learning
• Given: Training data consisting of training examples
(x1, y1), …, (xn, yn), where xi is an input example (a d-dimensional vector of
attribute values) and yi is the label.
example label
1 x11 x12 … x1d y1
… … … … … …
i xi1 xi2 … xid yi
… … … … … …
n xn1 xn2 … xnd yn

• Goal: learn a hypothesis function h(x) that approximates the true

relationship between x and y. This function should: 1) ideally be consistent
with the training data. 2) generalize to unseen examples.

• In NLP yi typically form a finite, discrete set.

Running Machine Learning
Experiments
• When running machine learning experiments we typically split the
labeled data in three sections:
Training Validation Test

• For example: 80% Training, 10% Validation (development), 10%

Test or 90/5/5

• Validation set is used to tune model parameters (for example

smoothing parameters), but cannot be used for training. This can
help with overfitting.

• Test set is used to assess the performance of the final model and
provide an estimation of the test error.

Note: Never train or tune parameters on the test set!

Representing Documents
to be, or not to be

• Set-of-words representation.
to or
be not

• Bag-of-words representation (Multi-set). be

to or not
be to
• Vector-space model: Each word corresponds to one dimension in
vector space. Entries are either:

• Binary (Word appears / does not appear)

be 2
⋮ ⋮
• Raw or normalized frequency counts. not 1
⋮ ⋮
• Weighted frequency counts or 1
⋮ ⋮
• Probabilities. to 2
What is a Word?
• e.g., are “Cat”, “cat” and “cats” the same word?

• “September” and “Sept”?

• “zero” and “oh”?

• Is “_” a word? “.”? “*”? ‘(‘?

• How many words are there in “don’t” ? “Gonna” ? “I.B.M.”?

• In Japanese and Chinese text -- how do we identify a word?

• …
Text Normalization

• Every NLP task needs to do some text normalization.

• Segmenting / tokenizing words in running text.

• Normalizing word forms (lemmatization or stemming,

possibly replacing named-entities).

• Sentence splitting.
Linguistic Terminology
• Sentence: Unit of written language.

• Utterance: Unit of spoken language.

• Word Form: the inflected form as it actually appears in the corpus. “produced”

• Word Stem: The part of the word that never changes between morphological
variations. “produc”

• Lemma: an abstract base form, shared by word forms, having the same stem,
part of speech, and word sense – stands for the class of words with stem.
“produce”

• Type: number of distinct words in a corpus (vocabulary size).

• Token: Total number of word occurrences.

Tokenization
• Tokenization: The process of segmenting text (a sequence
of characters) into a sequence of tokens (words).

“Mr. O'Neill thinks that the boys' stories about Chile's capital aren't
amusing.”

mr. o'neill thinks that the boys' stories about Chile's capital are n't
amusing .

• Simple (but weak) approach: Separate off punctuation. Then

split on whitespaces.

• Typical implementations use regular expressions (finite state

automata).
Tokenization Issues
• Dealing with punctuation (some may be part of a word)
“Ph.D.”, “O’Reilly”, “pick-me-up”

• Which tokens to include (punctuation might be useful for

parsing, but not for text classification)?

• Language dependent: Some languages don’t separate

words with whitespaces.
de: “Lebensversicherungsgesellschaftsangestellter”

zh: 日文章鱼怎么说? - Japanese Octopus how say?

日文章鱼怎么说? - Sun article fish how say?

Chinese example from Sproat (1996)

Lemmatization
• Converting Lemmas into their base form.

“Mr. O'Neill thinks that the boys' stories about Chile's capital aren't
amusing.”

mr. o'neill think that the boy story about chile's capital are n't
amusing .

PER PER think that the boy story about LOC ’s capital are n't
amusing .
Probabilities in NLP
• Ambiguity is everywhere in NLP. There is often uncertainty about the
“correct” interpretation. Which is more likely:

• Speech recognition: “recognize speech” vs. “wreck a nice beach”

• Machine translation: “l’avocat general”: “the attorney general” vs.

“the general avocado”

• Text classification: is a document that contains the word “rice”

more likely to be about politics or about agriculture?
What if it also includes several occurrences of the word “stir”?

• Probabilities make it possible to combine evidence from multiple

sources systematically to (using Bayesian statistics)
Bayesian Statistics
• Typically, we observe some evidence (for example, words
in a document) and the goal is to infer the “correct”
interpretation (for example, the topic of a text).

• Probabilities express the degree of belief we have in the

possible interpretations.

• Prior probabilities: Probability of an interpretation prior

to seeing any evidence.

• Conditional (Posterior) probability: Probability of an

interpretation after taking evidence into account.
Probability Basics
• Begin with a sample space

• Each is a possible basic outcome / “possible

world” (e.g. the 6 possible rolls of a die).

• A probability distribution assigns a probability to each

basic outcome.

• E.g: six-sided die

Events
• An event A is any subset of .

• Example:
Random Variables
• A random variable is a function from basic outcomes to
some range, e.g. real numbers or booleans.

• A distribution induces a probability distribution for any

random variable.

• E.g
Joint and Conditional Probability
Joint probability: also written as

A B

Conditional probability:
Rules for Conditional
Probability
• Product rule:

• Chain rule (generalization of product rule):

• Bayes’ Rule:
Independence
• Two events are independent if

or equivalently (if )

• Two events are conditionally independent if:

or equivalently
and

B C
Probabilities and Supervised
Learning
• Given: Training data consisting of training examples
data = (x1, y1), …, (xn, yn),
Goal: Learn a mapping h from x to y.

• We would like to learn this mapping using .

• Two approaches:

• Discriminative algorithms learn directly.

• Generative algorithms use Bayes rule

Discriminative Algorithms
• Model conditional distribution of the label given the data

• Learns decision boundaries that separate instances of the

different classes.

• To predict a new example, check on which side of the

decision boundary it falls.

• Examples:
support vector machine (SVM), decision trees, random
forests, neural networks, log-linear models.
Generative Algorithms
• Assume the observed data is being “generated” by a
“hidden” class label.

• Build a different model for each class.

• To predict a new example, check it under each of the

models and see which one matches best.

• Estimate and . Then use bases rule

• Examples:
Naive Bayes, Hidden Markov Models, Gaussian Mixture Models, PC
Naive Bayes
Label
Label

…
X1 X2 Xd Attributes
Naive Bayes Classifier
Label

X1 X2
… Xd

Note that the normalizer α does no longer matter for the argmax
because α is independent of the class label.
Training the Naive Bayes’
Classifier
• Goal: Use the training data to estimate P(Label) and P(Xi|Label)
from training data.

• Estimate the prior and posterior probabilities using Maximum

Likelihood Estimates (MLE):

• I.e. we just count how often each token in the document appears
together with each class label.
Why the Independence
Assumption Matters
• Without the independence assumption we would have to
estimate

• There would be many combinations of x1,…, xd that are

never seen (sparse data).

• The independence assumption allows us to estimate each

independently.

Is this a safe assumption for documents?

Are the words really independent of each other?
Training the Naive Bayes’
Classifier
• Ways to improve this model?

• Some issues to consider...

• What if there are words that do not appear in the

training set? What if it appears only once?

• What if the plural of a word never appears in the training

set?

• How are extremely common words (e.g., “the”, “a”)

handled?
Acknowledgments

• Some slides and examples from:

• Kathy McKeown, Dragomir Radev

AL Business Studies Paper Tamil Medium
No ratings yet
AL Business Studies Paper Tamil Medium
13 pages
Factors Influencing Consumer Behaviour
100% (1)
Factors Influencing Consumer Behaviour
5 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
05 Text Classification - Naive Bayes
No ratings yet
05 Text Classification - Naive Bayes
64 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
NB 24 Aug
No ratings yet
NB 24 Aug
79 pages
NB 24 Aug
No ratings yet
NB 24 Aug
85 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Text Classification
No ratings yet
Text Classification
60 pages
Multimedia Application L7 - For
No ratings yet
Multimedia Application L7 - For
46 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
Lec 2
No ratings yet
Lec 2
21 pages
Multinomial NB
No ratings yet
Multinomial NB
52 pages
Classification
No ratings yet
Classification
81 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
L3 LanguageModels
No ratings yet
L3 LanguageModels
118 pages
Qta Lse Day4 PDF
No ratings yet
Qta Lse Day4 PDF
59 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
BAI601 Module 3 PDF
No ratings yet
BAI601 Module 3 PDF
19 pages
Week 4
No ratings yet
Week 4
45 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
COMP2050-Lecture 22 - Machine Learning
No ratings yet
COMP2050-Lecture 22 - Machine Learning
47 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Introduction To AI - Prof - NiloyGanguly
No ratings yet
Introduction To AI - Prof - NiloyGanguly
49 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
NLP NB
No ratings yet
NLP NB
52 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
04 Textcat
No ratings yet
04 Textcat
101 pages
NBayes 1 20 2011 Ann
No ratings yet
NBayes 1 20 2011 Ann
21 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Express Publishing - My Phonics - Sample
100% (10)
Express Publishing - My Phonics - Sample
24 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
WSDM 1 31 15
No ratings yet
WSDM 1 31 15
108 pages
SMEPA 2020-2021: May Anne C. Almario
100% (4)
SMEPA 2020-2021: May Anne C. Almario
19 pages
NLP Unit 4
No ratings yet
NLP Unit 4
22 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Augmented Reality
No ratings yet
Augmented Reality
25 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
Practice Midterm Solutions
No ratings yet
Practice Midterm Solutions
7 pages
MLRD 2
No ratings yet
MLRD 2
15 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
NLP Assignement Solution
No ratings yet
NLP Assignement Solution
6 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Polytechnic Schemes
No ratings yet
Polytechnic Schemes
2 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
Ai ML Virtual Internship
No ratings yet
Ai ML Virtual Internship
50 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
978 1 4438 7228 7 Sample
No ratings yet
978 1 4438 7228 7 Sample
30 pages
A Gift of Fire - Chapter 9
100% (1)
A Gift of Fire - Chapter 9
16 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
Evolution in Operating Systems
No ratings yet
Evolution in Operating Systems
14 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Tals Reviewer (Midterm) PDF
No ratings yet
Tals Reviewer (Midterm) PDF
14 pages
Mataasnakahoy Senior High School Kassandra Kay K. de Roxas Entrepreneurship JUNE 10-14, 2019 TVL 11 - Courage Mon-Tues-Thurs-Fri: 7:00-8:00
No ratings yet
Mataasnakahoy Senior High School Kassandra Kay K. de Roxas Entrepreneurship JUNE 10-14, 2019 TVL 11 - Courage Mon-Tues-Thurs-Fri: 7:00-8:00
5 pages
Welcome Aboard A
No ratings yet
Welcome Aboard A
7 pages
NSTP Module 2 - Good Citizenship Part 2
No ratings yet
NSTP Module 2 - Good Citizenship Part 2
10 pages
Morning Book Affirmations
No ratings yet
Morning Book Affirmations
18 pages
Django Reinhard-Dinette PDF
No ratings yet
Django Reinhard-Dinette PDF
2 pages
Fat Blast Hiit 6 Weeks To A Fitter You
No ratings yet
Fat Blast Hiit 6 Weeks To A Fitter You
7 pages
3rd Quarter Music Week 1 Contemporary Philippine Music Traditional Composers
No ratings yet
3rd Quarter Music Week 1 Contemporary Philippine Music Traditional Composers
9 pages
Els Q2W2
No ratings yet
Els Q2W2
2 pages
Healthcare Innovation (Bringing The Buzzword To Real-World Healthcare Settings)
No ratings yet
Healthcare Innovation (Bringing The Buzzword To Real-World Healthcare Settings)
5 pages
Week 5 Tasks
No ratings yet
Week 5 Tasks
5 pages
Math 6
No ratings yet
Math 6
13 pages
ME 5104 Syllabus C14
No ratings yet
ME 5104 Syllabus C14
4 pages
So3 b1 Quick Quiz U1b PDF
No ratings yet
So3 b1 Quick Quiz U1b PDF
2 pages
Communication For Professionals (Backlog) - Hmts 1011-2023
No ratings yet
Communication For Professionals (Backlog) - Hmts 1011-2023
4 pages
Tips For A Newscast
No ratings yet
Tips For A Newscast
2 pages
The Importance of English Language For Now and The Future
No ratings yet
The Importance of English Language For Now and The Future
10 pages
Grading Sheet: Tusik Elementary School
No ratings yet
Grading Sheet: Tusik Elementary School
1 page
National Dong Hwa University Fees
No ratings yet
National Dong Hwa University Fees
2 pages
Aakriti: Female, 21 Years 13202001@ksom - Ac.in
No ratings yet
Aakriti: Female, 21 Years 13202001@ksom - Ac.in
1 page
Set Theory for Beginners: Foundational Mathematics for Software Developers, #1
From Everand
Set Theory for Beginners: Foundational Mathematics for Software Developers, #1
Subhomoy Haldar
No ratings yet

Lecture 02

Uploaded by

Lecture 02

Uploaded by

Natural Language

“How long does it take politics

• Mood / Sentiment detection.

• Identifying political affiliation.

• Word Sense Disambiguation.

• How do we represent each document?

• Can use different ML techniques.

• Supervised ML: Fixed set of classes C.

• Discriminative vs. Generative models.

• Unsupervised ML: Unknown set of classes C.

• Unsupervised learning: Learn patterns in the input without any

• Semi-supervised learning: Start with a few labeled

• Reinforcement learning: Start with a policy determining the

• Goal: learn a hypothesis function h(x) that approximates the true

• In NLP yi typically form a finite, discrete set.

• For example: 80% Training, 10% Validation (development), 10%

• Validation set is used to tune model parameters (for example

Note: Never train or tune parameters on the test set!

• Bag-of-words representation (Multi-set). be

• Binary (Word appears / does not appear)

• “September” and “Sept”?

• “zero” and “oh”?

• Is “_” a word? “.”? “*”? ‘(‘?

• How many words are there in “don’t” ? “Gonna” ? “I.B.M.”?

• In Japanese and Chinese text -- how do we identify a word?

• Every NLP task needs to do some text normalization.

• Segmenting / tokenizing words in running text.

• Normalizing word forms (lemmatization or stemming,

• Utterance: Unit of spoken language.

• Type: number of distinct words in a corpus (vocabulary size).

• Token: Total number of word occurrences.

• Simple (but weak) approach: Separate off punctuation. Then

• Typical implementations use regular expressions (finite state

• Which tokens to include (punctuation might be useful for

• Language dependent: Some languages don’t separate

zh: 日文章鱼怎么说? - Japanese Octopus how say?

Chinese example from Sproat (1996)

• Speech recognition: “recognize speech” vs. “wreck a nice beach”

• Machine translation: “l’avocat general”: “the attorney general” vs.

• Text classification: is a document that contains the word “rice”

• Probabilities make it possible to combine evidence from multiple

• Probabilities express the degree of belief we have in the

• Prior probabilities: Probability of an interpretation prior

• Conditional (Posterior) probability: Probability of an

• Each is a possible basic outcome / “possible

• A probability distribution assigns a probability to each

• E.g: six-sided die

• A distribution induces a probability distribution for any

• Chain rule (generalization of product rule):

• Two events are conditionally independent if:

• We would like to learn this mapping using .

• Discriminative algorithms learn directly.

• Generative algorithms use Bayes rule

• Learns decision boundaries that separate instances of the

• To predict a new example, check on which side of the

• Build a different model for each class.

• To predict a new example, check it under each of the

• Estimate and . Then use bases rule

• Estimate the prior and posterior probabilities using Maximum

• There would be many combinations of x1,…, xd that are

• The independence assumption allows us to estimate each

Is this a safe assumption for documents?

• Some issues to consider...

• What if there are words that do not appear in the

• What if the plural of a word never appears in the training

• How are extremely common words (e.g., “the”, “a”)

• Some slides and examples from:

• Kathy McKeown, Dragomir Radev

You might also like