0% found this document useful (0 votes)
5 views

Nlp4web Lecture 2 Text Classification

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Nlp4web Lecture 2 Text Classification

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

NLP and the Web – WS 2024/2025

Lecture 2
Foundations of Text Classification

Dr. Thomas Arnold


Hovhannes Tamoyan
Kexin Wang

Ubiquitous Knowledge Processing Lab


Technische Universität Darmstadt
Syllabus (tentative)

Nr. Lecture
01 Introduction / NLP basics
02 Foundations of Text Classification
03 IR – Introduction, Evaluation
04 IR – Word Representation, Data Collection
05 IR – Re-Ranking Methods
06 IR – Language Domain Shifts, Dense / Sparse Retrieval
07 LLM – Language Modeling Foundations
08 LLM – Neural LLM, Tokenization
09 LLM – Transformers, Self-Attention
10 LLM – Adaption, LoRa, Prompting
11 LLM – Alignment, Instruction Tuning
12 LLM – Long Contexts, RAG
13 LLM – Scaling, Computation Cost
14 Review & Preparation for the Exam

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 2


Last Lecture – Linguistic Analysis Levels

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 3
Today's lecture

Text Classification: Introduction

Algorithms
Naive Bayes
Hidden Markov Models

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 4


What is Text Classification?

Output tags /
Input Text Classification Model
classes

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 5


Examples – Spam Detection

From: Danke ([email protected])

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 6


Examples – Sentiment Analysis

Positive or negative movie review?

▪ This is it. This is the one. This is the worst movie ever made. Ever. It
beats everything. I have never seen worse.

▪ Expertly scripted and perfectly delivered, this searing parody leaves you
literally rolling with laughter.

▪ While watching this film I started to come up with things I would rather be
doing, including drinking bleach, rubbing sand in my eyes, and tax
returns.

▪ Just finished watching this movie for maybe the 7th or 8th time

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 7


More Examples

Topic Labeling Age/Gender Identification

I've seen George Foreman shadow


boxing and the shadow won.

Authorship Identification Language Identification


WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 8
Approaches
Rule-based

Handcrafted linguistic rules


Human comprehensible

Pro: Precision can be high


Con: Very expensive to build and maintain

IF "basketball" THEN
return top_sports
ELSEIF…

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 9


Approaches
Supervised Machine Learning

Step 1: Training

Tags

Classification Model
Input Text Feature
Extractor
Features

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 10


Approaches
Supervised Machine Learning

Step 2: Prediction

Classification Model
Input Text Feature
Extractor
Features

Tags

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 11


Approaches
Supervised Machine Learning

Classification model based on Training Data


Human comprehensible? Dependant on model!

Pro: Easier to maintain, usually more accurate


Con: Needs training data

Classification Model

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 12


Comparison to Neural Models

Kuzman T, Mozetič I, Ljubešić N. Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of
Classification Methods in the Era of Large Language Models. Machine Learning and Knowledge Extraction. 2023; 5(3):1149-1175

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 13


Today's lecture

Text Classification: Introduction

Algorithms
Naive Bayes
Hidden Markov Models

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 14


Naïve Bayes Classifier: Background

▪ Before starting let’s review related concepts:


▪ Conditional Probability
▪ Bayes' Rule.

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 15


Conditional Probability

▪ Conditional Probability in plain English:


▪ What is the probability that something will happen, given that something
else has already happened.
▪ Assume we have some Outcome O and some Evidence E.
▪ 𝑃(𝑂, 𝐸): the Probability of having both the Outcome O and Evidence E is the
multiplication of two probabilities:
▪ 𝑃(𝑂): Probability of O occurring
▪ 𝑃(𝐸|𝑂): Probability of E given that O happened

𝑃(𝑂, 𝐸) = 𝑃(𝑂) × 𝑃(𝐸|𝑂)

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 16


Conditional Probability: Example

▪ Let say we have a group of students


▪ Students could be either awake or asleep
▪ Students are also either bachelor or master students
▪ If we select one student randomly: what is the probability that this person
is a sleeping bachelor student?
▪ Conditional Probability can help us answer that:
▪ 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝 , 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟) = 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝) ∗ 𝑃 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟 𝐴𝑠𝑙𝑒𝑒𝑝)
▪ We could compute the exact same thing, the reverse way
▪ 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝 , 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟) = 𝑃(𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟) ∗ 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝 |𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟)

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 17


Bayes Rule

▪ Conceptually, this is a way to go from


𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒 | 𝑂𝑢𝑡𝑐𝑜𝑚𝑒) to 𝑃 𝑂𝑢𝑡𝑐𝑜𝑚𝑒 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒)
▪ Let‘s see how to do that
▪ We have:
▪ 𝑃(𝑂, 𝐸) = 𝑃(𝑂) × 𝑃(𝐸|𝑂) and 𝑃(𝑂, 𝐸) = 𝑃(𝐸) × 𝑃(𝑂|𝐸)
▪ → 𝑃(𝑂) × 𝑃(𝐸|𝑂) = 𝑃(𝐸) × 𝑃(𝑂|𝐸)

𝑃(𝐸|𝑂) × 𝑃(𝑂)
𝑃 𝑂𝐸 =
𝑃(𝐸)

Bayes Rule

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 18


Getting to Naïve Bayes'

▪ So far, we have talked only about one piece of evidence.


▪ In reality, we have to predict an outcome given multiple evidence
▪ math gets very complicated :(
▪ Naive Bayes‘ solution:
▪ treat each of piece of evidence as independent -> simpler math :)
▪ This approach is why this is called naive Bayes
▪ Suppose we have multiple evidences 𝐸1 , … , 𝐸𝑛 and an outcome 𝑂

𝑃 𝐸1 𝑂 × 𝑃 𝐸2 𝑂 × … × 𝑃 𝐸𝑛 𝑂 × 𝑃(𝑂)
𝑃 𝑂 𝐸1 , … , 𝐸𝑛 =
𝑃(𝐸1 , 𝐸2 , … , 𝐸𝑛 )

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 19


Getting to Naïve Bayes'

𝑃 𝐸1 𝑂 × 𝑃 𝐸2 𝑂 × … × 𝑃 𝐸𝑛 𝑂 × 𝑃(𝑂)
𝑃 𝑂 𝐸1 , … , 𝐸𝑛 =
𝑃(𝐸1 , 𝐸2 , … , 𝐸𝑛 )

Many people choose to remember this as:

𝑃(𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑜𝑓 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒) ∗ 𝑃𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒


𝑃 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒)

▪ Notes:
• If the 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒|𝑜𝑢𝑡𝑐𝑜𝑚𝑒) is 1, then we are just multiplying by 1.
• If the 𝑃(𝑠𝑜𝑚𝑒 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒|𝑜𝑢𝑡𝑐𝑜𝑚𝑒) is 0, then the whole probability becomes 0
• Since we divide everything by 𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒),
• we can even get away without calculating it.
• The intuition behind multiplying by the prior is
• to give high probability to more common outcomes, and low probabilities to unlikely outcomes.
• These are also called base rates and they are a way to scale our predicted probabilities

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 20


Naïve Bayes: Example

▪ Let's say that we have data on 1000 pieces of fruit: Banana, Orange or
some Other Fruit
▪ We know 3 characteristics about each fruit
▪ Whether it is Long
▪ Whether it is Sweet and
▪ If its color is Yellow
▪ Our training set
Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total

Banana 400 100 350 150 450 50 500

Orange 0 300 150 150 300 0 300

Other Fruit 100 100 150 50 50 150 200

Total 500 500 650 350 800 200 1000

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 21


Naïve Bayes: Example cont.

Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total

Banana 400 100 350 150 450 50 500

Orange 0 300 150 150 300 0 300

Other Fruit 100 100 150 50 50 150 200

Total 500 500 650 350 800 200 1000

▪ “Prior” probabilities
▪ P(Banana) = 500/1000 = 0.5, P(Orange) = 0.3, P(Other Fruit) = 0.2
▪ Probability of "Evidence“
▪ p(Long) = 500/1000 = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
▪ Probability of "Likelihood“
▪ P(Long|Banana) = 0.8, P(Long|Orange) = 0
▪ P(Yellow|Other Fruit) = 50/200 = 0.25, P(Not Yellow|Other Fruit) = 0.75

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 22


Naïve Bayes: Example cont.

▪ “Prior” probabilities
▪ P(Banana) = 0.5 (500/1000), P(Orange) = 0.3, P(Other Fruit) = 0.2
▪ Probability of "Evidence“
▪ p(Long) = 500/100 = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
▪ Probability of "Likelihood“
▪ P(Long|Banana) = 0.8, P(Long|Orange) = 0
▪ P(Yellow|Other Fruit) = 50/200 = 0.25, P(Not Yellow|Other Fruit) = 0.75

Given an unknown fruit which is long, sweet and yellow, is it Banana, Orange or Other Fruit?

P(Long|Banana) ∗ P(Sweet|Banana) ∗ P(Yellow|Banana) ∗ P(Banana)


▪ P Banana Long, Sweet and Yellow =
P(Long) ∗ P(Sweet) ∗ P(Yellow)
= 0.8 ∗ 0.7 ∗ 0.9 ∗ 0.5 / P(evidence)= 0.252 / P(evidence)
▪ P Orange Long, Sweet and Yellow = 0 , why?

P(Long|Other Fruit) ∗ P(Sweet|Other Fruit) ∗ P(Yellow|Other Fruit) ∗ P(Other Fruit)


▪ P Other Fruit Long, Sweet and Yellow =
P(Long) ∗ P(Sweet) ∗ P(Yellow)
= (100/200 ∗ 150/200 ∗ 50/150 ∗ 200/1000) / 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒) = 0.01875 / 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒)

0.252 >> 0.01875 => the unknown fruit is most likely a banana

Is it clear now why we don‘t need to calculated the P(evidence)?


WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 23
Naïve Bayes for Text Classification based on
Words as Features

▪ Multinomial Naïve Bayes Model: The probability of document d belonging


to a class c is proportional to the product of the probabilities of terms t
belonging to a class c, and to the class prior P(c):

▪ The best class c for a document d is found by selecting the class, for
which the maximum a posteriori (map) probability is maximal:

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 24


Summary on Naïve Bayes

▪ Bayesian methods provide the basis for probabilistic learning methods


▪ Bayesian methods can be used to determine the most probable
hypothesis given the data
▪ No training, just probability calculation
▪ Binary, numeric and nominal features can be mixed
▪ Naïve Bayes fails if the independence assumption is violated too much
▪ Especially identical or highly overlapping features pose a problem that has to be
addressed with proper feature selection

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 25


I didn't expect a kind of menti quiz.

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 26


Today's lecture

Text Classification: Introduction

Algorithms
Naive Bayes
Hidden Markov Models

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 27


Limitations of Standard classification

▪ Standard classification problem assumes


▪ individual cases are disconnected and independent
▪ Many NLP problems do not satisfy this assumption
▪ involve making many connected decisions, each resolving a different ambiguity,
but which are mutually dependent
▪ More sophisticated learning and inference techniques are needed

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 28


Sequence Labeling

▪ Many NLP problems can viewed as sequence labeling


▪ Each token in a sequence is assigned a label
▪ Labels of tokens are dependent on the labels of other tokens in the
sequence, particularly their neighbors
▪ Examples:
▪ Part Of Speech Tagging
In: John saw the saw and decided to take it to the table.
Out: PN V Det N Con V Part V Pro Prep Det N

▪ Named entity recognition: people organizations places


Michael Dell is the CEO of Dell Computer Corporation and lives in Austin Texas.

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 29


Use case: Part of Speech (POS) Tagging

▪ Given a sentence X, predict its part of speech sequence Y

▪ A type of “structured” prediction


▪ How can we do this?

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 30


Use case: Part of Speech (POS) Tagging

Possible Answers
▪ Sequence labeling as classification:
▪ Pointwise prediction: predict each word individually with a classifier
▪ Generative sequence models: e.g. Hidden Markov Model (HMM)

▪ Later in the lecture: Neural Sequence Models (RNN / LSTM)

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 31


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

PN

Slides from: Raymond J. Mooney


WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 32
Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 33


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

Det

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 34


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 35


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

Conj

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 36


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 37


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

Part

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 38


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 39


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

Pro

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 40


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

Prep

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 41


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

Det

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 42


Sequence Labeling as Classification

▪ Classify each token independently but use as input features, information


about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 43


Sequence Labeling as Classification

Using Outputs as Inputs


▪ Better input features are usually the categories of the surrounding tokens,
but these are not available yet.
▪ Can use category of either the preceding or succeeding tokens by going
forward or back and using previous output.

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 44


Forward Classification

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 45


Forward Classification

PN
John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 46


Forward Classification

PN V
John saw the saw and decided to take it to the table.

classifier

Det

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 47


Forward Classification

PN V Det
John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 48


Forward Classification

PN V Det N
John saw the saw and decided to take it to the table.

classifier

Conj

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 49


Forward Classification

PN V Det N Conj
John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 50


Forward Classification

PN V Det N Conj V
John saw the saw and decided to take it to the table.

classifier

Part

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 51


Forward Classification

PN V Det N Conj V Part


John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 52


Forward Classification

PN V Det N Conj V Part V


John saw the saw and decided to take it to the table.

classifier

Pro

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 53


Forward Classification

PN V Det N Conj V Part V Pro


John saw the saw and decided to take it to the table.

classifier

Prep

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 54


Forward Classification

PN V Det N Conj V Part V Pro Prep


John saw the saw and decided to take it to the table.

classifier

Det

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 55


Forward Classification

PN V Det N Conj V Part V Pro Prep Det


John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 56


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 57


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

N
John saw the saw and decided to take it to the table.

classifier

Det

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 58


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

Det N
John saw the saw and decided to take it to the table.

classifier

Prep

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 59


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

Prep Det N
John saw the saw and decided to take it to the table.

classifier

Pro

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 60


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 61


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.


▪ Preposition? (“I am heading to the store”)
▪ Particle? (in front if infinitive verbs, like “I want to eat. I am going to leave.”)

V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

Part

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 62


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

Part V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 63


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

V Part V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

Conj

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 64


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

Conj V Part V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 65


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

V Conj V Part V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

Det

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 66


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

Det V Conj V Part V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 67


Backward Classification

▪ Disambiguating “to” in this case would be even easier backward.

V Det V Conj V Part V Pro Prep Det N


John saw the saw and decided to take it to the table.

classifier

PN

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 68


Sequence Labeling as Classification

Problems
▪ Not easy to integrate information from category of tokens on both sides
▪ Difficult to propagate uncertainty between decisions and “collectively”
determine the most likely joint assignment of categories to all of the
tokens in a sequence.

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 69


Probabilistic Sequence Models

▪ Probabilistic sequence models allow


▪ integrating uncertainty over multiple, interdependent classifications
▪ and collectively determine the most likely global assignment.

▪ Generative sequence models: e.g. Hidden Markov Model (HMM)

▪ Later in the lecture: Neural Sequence Models (RNN / LSTM)

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 70


Intuition of HMM decoding: PoS Tagging

▪ Choose the tag sequence that is most probable given the observation sequence
of n words:
Tag sequence

Best tag sequence Word sequence

▪ Bayes‘ Rule:,

▪ Drop the denominator

▪ Assumption 1: word appearing depends only on its own tag

▪ Assumption 2: the probability of a tag is dependent only on the previous tag

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 71


Hidden Markov Model

A Hidden Markov Model is a statistical model of hidden, stochastic state transitions


with observable, stochastic output. Key features:

▪ A fixed set of states


▪ At each "time", the hidden markov model is in exactly one of these states
▪ State transition probabilities
▪ The starting state can be fixed or probabilistic
▪ A fixed set of possible outputs
▪ For each state: a distribution of probabilities for every possible output
▪ Also called emission probabilities

Task:

For an observed output sequence, what is the (hidden) state sequence that has the
highest probabiliy to produce this output?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 72
Hidden Markov Model - Example

Every day, Darth Vader is in one of three moods: Good, Neutral or Bad
But, because he wears his mask, we cannot observe it!

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 73


Hidden Markov Model - Example

Every day, Darth Vader is in one of three moods: Good, Neutral or Bad
But, because he wears his mask, we cannot observe it!

(image from pixabay.com, CC0 Creative Commons licence)

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 74


Hidden Markov Model - Example

Somehow, you know the odds how his mood changes from day to day:

0.4
0.8 0.4
0.2 0.5

0.2 0.5

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 75


Hidden Markov Model - Example

You also know the chances for his mood on day 1:


S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5

0.2 0.5

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 76


Hidden Markov Model - Example

What we CAN observe is if Darth Vader destroys a planet or not, which depends on
his mood!
S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5

0.2 0.5

0.5 0.5 0.3 0.7 0.1 0.9

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 77


Hidden Markov Model - Example

We observe that he does not destroy a planet on the first day, but he
destroys a planet each on the second and third day:

Question:

What is the most probable sequence of his mood on these three days?

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 78


Hidden Markov Model - Example

What is the probability for this mood


sequence and this observation:

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 79


Hidden Markov Model - Example

What is the probability for this mood


sequence and this observation:

First day:

Transition probability: S 0.3


Emission probability: 0.1
Joint probability: 0.3 * 0.1 = 0.03

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 80


Hidden Markov Model - Example

What is the probability for this mood


sequence and this observation:

Second day:

Transition probability: 0.5


Emission probability: 0.7
Joint probability: 0.5 * 0.7 = 0.35

Probability for sequence: 0.03 (day one) * 0.35 (day two) = 0.0105

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 81


Hidden Markov Model - Example

What is the probability for this mood


sequence and this observation:

Third day:

Transition probability: 0.4


Emission probability: 0.9
Joint probability: 0.4 * 0.9 = 0.36

Probability for sequence: 0.03 * 0.35 * 0.36 = 0.00378

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 82


Hidden Markov Model – Application to NLP

In our POS tagging example, we know the sequence of words, and we want to
know the sequence of POS tags!
transitions
▪ (hidden) States: POS tags
states t1 t2 t3 t4
▪ (observable) Outputs: Tokens

▪ We also need outputs w1 w2 w3 w4


▪ Transition probabilities between states
▪ Also: Initial probabilities = Probabilities for the first token of a sentence
▪ Emission probabilities for every state

How would our graph from the example look like?

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 83


Hidden Markov Model – Application to NLP

S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5
N V DT
0.2 0.5

0.6 0.0 0.1 0.1 0.1 0.8


0.4 0.8 0.1

Tom saw the Tom saw the Tom saw the

How do we get to the transition and emission probabilities?

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 84


Hidden Markov Model – Application to NLP:
Estimating the Probabilities

▪ The probabilities are estimated just by counting on a tagged training


corpus
▪ Transition probability: how often the first tag is followed by the second divided
by the number of the times the first tag was seen in a labeled corpus

▪ The emission probabilities: the number of times the word was associated with
the tag in the labeled corpus divided by number of the times the first tag was
seen in a labeled corpus

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 85


Example

Input sentence: Janet will back the bill


Correct PoS tags: Janet/NNP will/MD back/VB the/DT bill/NN

Transition probabilities based on WSJ corpus


Rows are labeled with the conditioning event; thus P(VB|MD) is 0.7968.

Given the observation (output) likelihoods

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 86


Hidden Markov Model – Complexity

Question: What is the most likely state sequence given an output sequence?

▪ Naïve solution:
▪ brute force search by enumerating all possible sequences of states
▪ Complexity 𝑂 𝑠 𝑚
▪ where m is the length of the input and s is the number of states in the model.
▪ Better solution: Dynamic Programming!
▪ Standard procedure is called the Viterbi algorithm
▪ Running time is 𝑂(𝑚𝑠 2 ),
▪ where m is the length of the input and s is the number of states in the model.

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 87


Viterbi algorithm: Motivation

▪ Let 𝑇: 𝑜1 , 𝑜2 , … , 𝑜𝑇 some input sentence


▪ Let 𝑆: 𝑠1 , 𝑠2 , … , 𝑠𝑇 sequence of tags
▪ Goal: best sequence of tags given the input sequence
▪ argmax𝑠1 ,𝑠2,…,𝑠𝑇 𝑃(𝑜1 , … , 𝑜𝑇 , 𝑠𝑇 , 𝑠2 , … , 𝑠𝑇 )
▪ Example
▪ T = The, man, saw, the saw
▪ S = {Det, N, V}
▪ Possible tag sequences:
-> Det Det Det Det Det -> 0.000003
Det Det Det Det N -> 0.000008
Det Det Det Det V -> 0.000009

Det N V Det N -> 0.012

we have 35 (243) sequences: sentence length = 5, #tags = 3

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 88


Viterbi algorithm: Basic idea

Let us say we have only two possible states, A and B, and some observation o
What is the best possible state sequence of length 5 for this observation o?
It is either:

▪ the best possible sequence of length 4 that ends with A, followed by A


▪ the best possible sequence of length 4 that ends with A, followed by B
▪ the best possible sequence of length 4 that ends with B, followed by A
▪ the best possible sequence of length 4 that ends with B, followed by B

This is only true because the next state only depends on the state directly before!

So, what is the best possible sequence of length 4 that ends with A? It is either:
▪ the best possible sequence of length 3 that ends with A, followed by A
▪ the best possible sequence of length 3 that ends with B, followed by A
▪ …
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 89
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw the saw $$

DT

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 90


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw the saw $$

0.6 * 0.5
DT = 0.3

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 91


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw the saw $$

DT 0.3

V 0

N 0

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 92


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}

N 0

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 93


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}

N 0

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 94


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}

N 0

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 95


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}

N 0

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 96


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 = max {
0.1 * 0.3 * 0.3,
0.1 * 0.2 * 0,
V 0 ? 0.1 * 0.6 * 0
} = 0.009

N 0

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 97


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 = max {
0.1 * 0.3 * 0.3,
0.1 * 0.2 * 0,
0.009
V 0 DT 0.1 * 0.6 * 0
} = 0.009

N 0 Predecessor of the most probable path

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 98


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man

DT 0.3 0

0.009
V 0 DT

0.054
N 0 DT

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 99


Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw

DT 0.3 0 = max {
P (saw | V) * P (V | DT) * 0,
P (saw | V) * P (V | V) * 0.009,
0.009
V 0 DT ? P (saw | V) * P (V | N) * 0.054
}
0.054
N 0 DT

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 100
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw

DT 0.3 0 = max {
0.2 * 0.3 * 0, ( = 0)
0.2 * 0.2 * 0.009, ( = 0,00036)
0.009 6,48 ∗ 10−3
V 0 DT N 0.2 * 0.6 * 0.054 ( = 0,00648)
} = 0,00648 = 6,48 ∗ 10−3
0.054
N 0 DT

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 101
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw the saw $$

1,62 ∗ 10−3
DT 0.3 0 0 V 0

0.009 6,48 ∗ 10−3 9,72 ∗ 10−5 1,94 ∗ 10−4


V 0 DT N 0 DT N

0.054 2,16 ∗ 10−3 1,94 ∗ 10−4


N 0 DT N 0 DT

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 102
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

saw
The man saw the $$
N

1,62 ∗ 10−3
DT 0.3 0 0 V 0

0.009 6,48 ∗ 10−3 9,72 ∗ 10−5


V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3 1,94 ∗ 10−4
N 0 DT N 0 DT

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 103
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

the saw
The man saw $$
DT N

1,62 ∗ 10−3
DT 0.3 0 0 V 0

0.009 6,48 ∗ 10−3 9,72 ∗ 10−5


V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 104
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

saw the saw


The man $$
V DT N

DT 0.3 0 0 V 0

0.009 6,48 ∗ 10−3 9,72 ∗ 10−5


V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 105
Viterbi algorithm: Example

Emission probabilities Transition probabilities


P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1

The man saw the saw


$$
DT N V DT N

DT 0.3 0 0 V 0

0.009 9,72 ∗ 10−5


V 0 DT N 0 DT N
2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 106
Viterbi algorithm: Example

Remember the complexity of the Viterbi algorithm?


Running time is 𝑂(𝑚𝑠 2 ),
▪ where m is the length of the input and s is the number of states in the model.

Now we see why: For every token (m) we we have to evaluate every POS (s) in combination with every
possible predeccessor POS (s), so he have m * s * s operations = 𝑚𝑠 2

The man saw the saw


$$
DT N V DT N

DT 0.3 0 0 V 0

0.009 9,72 ∗ 10−5


V 0 DT N 0 DT N
2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 107
Summary

▪ Sequence Labeling:
▪ Input and output are signal sequences
▪ No individual classification per signal, but joint classification that minimizes
some cost
▪ Hidden Markov Models
▪ Emissions can be observed
▪ States are hidden
▪ Goal: Find most probable state sequence for a given emission sequence
▪ Solve via Viterbi (dynamic programming)

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 108
Next Lecture

Information Retrieval
Introduction

WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 109

You might also like