Nlp4web Lecture 2 Text Classification
Nlp4web Lecture 2 Text Classification
Lecture 2
Foundations of Text Classification
Nr. Lecture
01 Introduction / NLP basics
02 Foundations of Text Classification
03 IR – Introduction, Evaluation
04 IR – Word Representation, Data Collection
05 IR – Re-Ranking Methods
06 IR – Language Domain Shifts, Dense / Sparse Retrieval
07 LLM – Language Modeling Foundations
08 LLM – Neural LLM, Tokenization
09 LLM – Transformers, Self-Attention
10 LLM – Adaption, LoRa, Prompting
11 LLM – Alignment, Instruction Tuning
12 LLM – Long Contexts, RAG
13 LLM – Scaling, Computation Cost
14 Review & Preparation for the Exam
Segmentation
Morphology
Syntax
Semantics
Algorithms
Naive Bayes
Hidden Markov Models
Output tags /
Input Text Classification Model
classes
▪ This is it. This is the one. This is the worst movie ever made. Ever. It
beats everything. I have never seen worse.
▪ Expertly scripted and perfectly delivered, this searing parody leaves you
literally rolling with laughter.
▪ While watching this film I started to come up with things I would rather be
doing, including drinking bleach, rubbing sand in my eyes, and tax
returns.
▪ Just finished watching this movie for maybe the 7th or 8th time
IF "basketball" THEN
return top_sports
ELSEIF…
Step 1: Training
Tags
Classification Model
Input Text Feature
Extractor
Features
Step 2: Prediction
Classification Model
Input Text Feature
Extractor
Features
Tags
Classification Model
Kuzman T, Mozetič I, Ljubešić N. Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of
Classification Methods in the Era of Large Language Models. Machine Learning and Knowledge Extraction. 2023; 5(3):1149-1175
Algorithms
Naive Bayes
Hidden Markov Models
𝑃(𝐸|𝑂) × 𝑃(𝑂)
𝑃 𝑂𝐸 =
𝑃(𝐸)
Bayes Rule
𝑃 𝐸1 𝑂 × 𝑃 𝐸2 𝑂 × … × 𝑃 𝐸𝑛 𝑂 × 𝑃(𝑂)
𝑃 𝑂 𝐸1 , … , 𝐸𝑛 =
𝑃(𝐸1 , 𝐸2 , … , 𝐸𝑛 )
𝑃 𝐸1 𝑂 × 𝑃 𝐸2 𝑂 × … × 𝑃 𝐸𝑛 𝑂 × 𝑃(𝑂)
𝑃 𝑂 𝐸1 , … , 𝐸𝑛 =
𝑃(𝐸1 , 𝐸2 , … , 𝐸𝑛 )
▪ Notes:
• If the 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒|𝑜𝑢𝑡𝑐𝑜𝑚𝑒) is 1, then we are just multiplying by 1.
• If the 𝑃(𝑠𝑜𝑚𝑒 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒|𝑜𝑢𝑡𝑐𝑜𝑚𝑒) is 0, then the whole probability becomes 0
• Since we divide everything by 𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒),
• we can even get away without calculating it.
• The intuition behind multiplying by the prior is
• to give high probability to more common outcomes, and low probabilities to unlikely outcomes.
• These are also called base rates and they are a way to scale our predicted probabilities
▪ Let's say that we have data on 1000 pieces of fruit: Banana, Orange or
some Other Fruit
▪ We know 3 characteristics about each fruit
▪ Whether it is Long
▪ Whether it is Sweet and
▪ If its color is Yellow
▪ Our training set
Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total
Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total
▪ “Prior” probabilities
▪ P(Banana) = 500/1000 = 0.5, P(Orange) = 0.3, P(Other Fruit) = 0.2
▪ Probability of "Evidence“
▪ p(Long) = 500/1000 = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
▪ Probability of "Likelihood“
▪ P(Long|Banana) = 0.8, P(Long|Orange) = 0
▪ P(Yellow|Other Fruit) = 50/200 = 0.25, P(Not Yellow|Other Fruit) = 0.75
▪ “Prior” probabilities
▪ P(Banana) = 0.5 (500/1000), P(Orange) = 0.3, P(Other Fruit) = 0.2
▪ Probability of "Evidence“
▪ p(Long) = 500/100 = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
▪ Probability of "Likelihood“
▪ P(Long|Banana) = 0.8, P(Long|Orange) = 0
▪ P(Yellow|Other Fruit) = 50/200 = 0.25, P(Not Yellow|Other Fruit) = 0.75
Given an unknown fruit which is long, sweet and yellow, is it Banana, Orange or Other Fruit?
0.252 >> 0.01875 => the unknown fruit is most likely a banana
▪ The best class c for a document d is found by selecting the class, for
which the maximum a posteriori (map) probability is maximal:
Algorithms
Naive Bayes
Hidden Markov Models
Possible Answers
▪ Sequence labeling as classification:
▪ Pointwise prediction: predict each word individually with a classifier
▪ Generative sequence models: e.g. Hidden Markov Model (HMM)
classifier
PN
classifier
classifier
Det
classifier
classifier
Conj
classifier
classifier
Part
classifier
classifier
Pro
classifier
Prep
classifier
Det
classifier
classifier
PN
John saw the saw and decided to take it to the table.
classifier
PN V
John saw the saw and decided to take it to the table.
classifier
Det
PN V Det
John saw the saw and decided to take it to the table.
classifier
PN V Det N
John saw the saw and decided to take it to the table.
classifier
Conj
PN V Det N Conj
John saw the saw and decided to take it to the table.
classifier
PN V Det N Conj V
John saw the saw and decided to take it to the table.
classifier
Part
classifier
classifier
Pro
classifier
Prep
classifier
Det
classifier
classifier
N
John saw the saw and decided to take it to the table.
classifier
Det
Det N
John saw the saw and decided to take it to the table.
classifier
Prep
Prep Det N
John saw the saw and decided to take it to the table.
classifier
Pro
classifier
classifier
Part
classifier
classifier
Conj
classifier
classifier
Det
classifier
classifier
PN
Problems
▪ Not easy to integrate information from category of tokens on both sides
▪ Difficult to propagate uncertainty between decisions and “collectively”
determine the most likely joint assignment of categories to all of the
tokens in a sequence.
▪ Choose the tag sequence that is most probable given the observation sequence
of n words:
Tag sequence
▪ Bayes‘ Rule:,
Task:
For an observed output sequence, what is the (hidden) state sequence that has the
highest probabiliy to produce this output?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 72
Hidden Markov Model - Example
Every day, Darth Vader is in one of three moods: Good, Neutral or Bad
But, because he wears his mask, we cannot observe it!
Every day, Darth Vader is in one of three moods: Good, Neutral or Bad
But, because he wears his mask, we cannot observe it!
Somehow, you know the odds how his mood changes from day to day:
0.4
0.8 0.4
0.2 0.5
0.2 0.5
0.2 0.5
What we CAN observe is if Darth Vader destroys a planet or not, which depends on
his mood!
S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5
0.2 0.5
We observe that he does not destroy a planet on the first day, but he
destroys a planet each on the second and third day:
Question:
What is the most probable sequence of his mood on these three days?
First day:
Second day:
Probability for sequence: 0.03 (day one) * 0.35 (day two) = 0.0105
Third day:
In our POS tagging example, we know the sequence of words, and we want to
know the sequence of POS tags!
transitions
▪ (hidden) States: POS tags
states t1 t2 t3 t4
▪ (observable) Outputs: Tokens
S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5
N V DT
0.2 0.5
▪ The emission probabilities: the number of times the word was associated with
the tag in the labeled corpus divided by number of the times the first tag was
seen in a labeled corpus
Question: What is the most likely state sequence given an output sequence?
▪ Naïve solution:
▪ brute force search by enumerating all possible sequences of states
▪ Complexity 𝑂 𝑠 𝑚
▪ where m is the length of the input and s is the number of states in the model.
▪ Better solution: Dynamic Programming!
▪ Standard procedure is called the Viterbi algorithm
▪ Running time is 𝑂(𝑚𝑠 2 ),
▪ where m is the length of the input and s is the number of states in the model.
Let us say we have only two possible states, A and B, and some observation o
What is the best possible state sequence of length 5 for this observation o?
It is either:
This is only true because the next state only depends on the state directly before!
So, what is the best possible sequence of length 4 that ends with A? It is either:
▪ the best possible sequence of length 3 that ends with A, followed by A
▪ the best possible sequence of length 3 that ends with B, followed by A
▪ …
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 89
Viterbi algorithm: Example
DT
0.6 * 0.5
DT = 0.3
DT 0.3
V 0
N 0
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
The man
DT 0.3 = max {
0.1 * 0.3 * 0.3,
0.1 * 0.2 * 0,
V 0 ? 0.1 * 0.6 * 0
} = 0.009
N 0
The man
DT 0.3 = max {
0.1 * 0.3 * 0.3,
0.1 * 0.2 * 0,
0.009
V 0 DT 0.1 * 0.6 * 0
} = 0.009
The man
DT 0.3 0
0.009
V 0 DT
0.054
N 0 DT
DT 0.3 0 = max {
P (saw | V) * P (V | DT) * 0,
P (saw | V) * P (V | V) * 0.009,
0.009
V 0 DT ? P (saw | V) * P (V | N) * 0.054
}
0.054
N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 100
Viterbi algorithm: Example
DT 0.3 0 = max {
0.2 * 0.3 * 0, ( = 0)
0.2 * 0.2 * 0.009, ( = 0,00036)
0.009 6,48 ∗ 10−3
V 0 DT N 0.2 * 0.6 * 0.054 ( = 0,00648)
} = 0,00648 = 6,48 ∗ 10−3
0.054
N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 101
Viterbi algorithm: Example
1,62 ∗ 10−3
DT 0.3 0 0 V 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 102
Viterbi algorithm: Example
saw
The man saw the $$
N
1,62 ∗ 10−3
DT 0.3 0 0 V 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 103
Viterbi algorithm: Example
the saw
The man saw $$
DT N
1,62 ∗ 10−3
DT 0.3 0 0 V 0
DT 0.3 0 0 V 0
DT 0.3 0 0 V 0
Now we see why: For every token (m) we we have to evaluate every POS (s) in combination with every
possible predeccessor POS (s), so he have m * s * s operations = 𝑚𝑠 2
DT 0.3 0 0 V 0
▪ Sequence Labeling:
▪ Input and output are signal sequences
▪ No individual classification per signal, but joint classification that minimizes
some cost
▪ Hidden Markov Models
▪ Emissions can be observed
▪ States are hidden
▪ Goal: Find most probable state sequence for a given emission sequence
▪ Solve via Viterbi (dynamic programming)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 108
Next Lecture
Information Retrieval
Introduction
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 109