Lecture 02
Lecture 02
Processing
Lectures 2: Language Classification. Probability Review.
Machine Learning Background. Naive Bayes’ Classifier.
10/29/2020
COMS W4705
Yassine Benajiba
Text Classification
• Given a representation of some document d, identify which
class the document belongs to.
computers
• Spam detection.
• Author identification.
• …
Text Classification
• This is a machine learning problem.
• Test set is used to assess the performance of the final model and
provide an estimation of the test error.
• Set-of-words representation.
to or
be not
• …
Text Normalization
• Sentence splitting.
Linguistic Terminology
• Sentence: Unit of written language.
• Word Form: the inflected form as it actually appears in the corpus. “produced”
• Word Stem: The part of the word that never changes between morphological
variations. “produc”
• Lemma: an abstract base form, shared by word forms, having the same stem,
part of speech, and word sense – stands for the class of words with stem.
“produce”
“Mr. O'Neill thinks that the boys' stories about Chile's capital aren't
amusing.”
mr. o'neill thinks that the boys' stories about Chile's capital are n't
amusing .
“Mr. O'Neill thinks that the boys' stories about Chile's capital aren't
amusing.”
mr. o'neill think that the boy story about chile's capital are n't
amusing .
PER PER think that the boy story about LOC ’s capital are n't
amusing .
Probabilities in NLP
• Ambiguity is everywhere in NLP. There is often uncertainty about the
“correct” interpretation. Which is more likely:
• Example:
Random Variables
• A random variable is a function from basic outcomes to
some range, e.g. real numbers or booleans.
• E.g
Joint and Conditional Probability
Joint probability: also written as
A B
Conditional probability:
Rules for Conditional
Probability
• Product rule:
• Bayes’ Rule:
Independence
• Two events are independent if
or equivalently (if )
or equivalently
and
B C
Probabilities and Supervised
Learning
• Given: Training data consisting of training examples
data = (x1, y1), …, (xn, yn),
Goal: Learn a mapping h from x to y.
• Two approaches:
• Examples:
support vector machine (SVM), decision trees, random
forests, neural networks, log-linear models.
Generative Algorithms
• Assume the observed data is being “generated” by a
“hidden” class label.
• Examples:
Naive Bayes, Hidden Markov Models, Gaussian Mixture Models, PC
Naive Bayes
Label
Label
…
X1 X2 Xd Attributes
Naive Bayes Classifier
Label
X1 X2
… Xd
Note that the normalizer α does no longer matter for the argmax
because α is independent of the class label.
Training the Naive Bayes’
Classifier
• Goal: Use the training data to estimate P(Label) and P(Xi|Label)
from training data.
• I.e. we just count how often each token in the document appears
together with each class label.
Why the Independence
Assumption Matters
• Without the independence assumption we would have to
estimate