0% found this document useful (0 votes)
19 views

Nlp4web Lecture 1 Intro

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Nlp4web Lecture 1 Intro

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

NLP and the Web – WS 2024/2025

Lecture 1
Introduction

Dr. Thomas Arnold


Hovhannes Tamoyan
Kexin Wang

Ubiquitous Knowledge Processing Lab


Technische Universität Darmstadt
Introduction: Teaching Staff

Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang


Lectures Practice Class Practice Class

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2


Outline

UKP Lab: profile and projects

Administrative course issues

NLP 4 Web Introduction

NLP Basics / Linguistic Analysis

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3


Who Are We?

▪ 1 Professor, ~5 Postdocs, ~35 Doctoral Researchers


▪ We mainly work in natural language processing (NLP)
▪ Research areas (growing every day!)

Deep Learning for NLP Knowledge Graphs

Argument Mining Interactive AI and NLP

Content Analytics for the Social Writing Assistance and Language


Good Learning

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4


Teaching Concept – UKP (Lectures)

Winter Term Summer Term

Information
Introductory
Management

Application NLP and the Web Ethics in NLP


Oriented

Advanced Deep Learning for


NLP

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5


Teaching Concept – UKP (Seminars & Projects)

Data Analysis Software Project


Software Project
for Natural Language
(irregular schedule)
Winter 2023/24: Various Projects
Winter 2024/25: Various Projects

Regular Seminar Text Analytics / Large Language Models


Winter 2023/24: Generative AI
Summer 2024: LLMs for Mental Health
Winter 2024/25: Understanding LLMs

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6


Complementary Lectures and Seminars

▪ Machine Learning
▪ Einführung in die künstliche Intelligenz (Kersting)
▪ Data Mining und maschinelles Lernen (Kersting)
▪ Deep Learning (Kersting)

▪ Computer Vision
▪ Computer Vision 1 and 2 (Roth)

▪ Natural Language Processing


▪ Deep Learning for NLP
▪ Ethics in NLP

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7


Teaching Concept – UKP (PhD)

▪ Get involved early (HiWi, B.Sc. thesis, M.Sc. thesis)

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8


More information

• Website:
www.ukp.tu-darmstadt.de

• GitHub:
www.github.com/UKPLab

• Social Media:

@UKPLab

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9


Outline

UKP Lab: profile and projects

Administrative course issues

NLP 4 Web Introduction

NLP Basics / Linguistic Analysis

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10


Course Goals

▪ Learn the basic principles underlying NLP systems

▪ Two big NLP topics:


▪ Information Retrieval (IR)
▪ Large Language Model (LLM) Applications

▪ Gain insight into open research problems in natural language


processing

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11


Why Care?

Information Overload

Business Intelligence

Need for Robust, Intelligent Systems

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12


Textbook

Constantly updated:

▪ Speech and Language Processing. An Introduction to Natural Language


Processing, Computational Linguistics, and Speech Recognition. Daniel
Jurafsky and James H. Martin. 3nd edition, 2023 (draft).
▪ https://web.stanford.edu/~jurafsky/slp3/

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13


General Information

▪ All lectures and practice classes will be in person


Lectures: Tuesdays 13:30 – 15:10, S306 / 051
Practice Class: Thursdays 16:15 – 17:55, S103 / 221

▪ All slides, handouts, readings etc. can be found on the


Moodle e-Learning platform

▪ We also use Moodle as a central point for announcements and questions

▪ Please use the Moodle forum!

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14


General Information – Practice Class

▪ In the practice classes, you will work on programming exercises


▪ Programming language is Python
▪ First practice session will include a brief introduction to Python
▪ This will give you some practical experience in NLP
▪ Practice class topics are relevant for the exam! (including Python)

▪ In addition, there are homework assignments for an exam bonus:


▪ Assignments will be bi-weekly – 6 exercises in total
▪ Each assignment is worth a maximum of 20 points
▪ If you get >= 75% of the points (>= 90 points), you get a bonus
▪ You can improve your grade by 0.3/0.4 IFF you pass the exam without bonus

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15


General Information – Practice Class

▪ First class: October 24th (no practice class this week)

▪ Details will be announced in moodle


▪ If you need additional help regarding the practice class, use the Moodle forum

The assignments will require a significant amount of time, so start earlier


than the day before submission.

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16


Final exam

Tuesday, 25.02.2025, 15:00


More info be announced in Moodle
▪ Allowed: Non-programmable calculator, no other material
▪ Content: lecture, readings, practice class

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17


Syllabus (tentative)

Nr. Lecture
01 Introduction / NLP basics
02 Foundations of Text Classification
03 IR – Introduction, Evaluation
04 IR – Word Representation, Data Collection
05 IR – Re-Ranking Methods
06 IR – Language Domain Shifts, Dense / Sparse Retrieval
07 LLM – Language Modeling Foundations
08 LLM – Neural LLM, Tokenization
09 LLM – Transformers, Self-Attention
10 LLM – Adaption, LoRa, Prompting
11 LLM – Alignment, Instruction Tuning
12 LLM – Long Contexts, RAG
13 LLM – Scaling, Computation Cost
14 Review & Preparation for the Exam

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18


Warm up

Now it is your turn:

Which degree programme are you studying?

▪ Computer Science?
▪ Bachelor?
▪ Master?
▪ Other disciplines?

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19


Warm up

Now it is your turn:

Which other UKP courses did you already attend?

▪ FoLT
▪ Ethics in Natural Language Processing
▪ Deep Learning for NLP
▪ Data Analysis Software Project
▪ Text Analytics / LLM Seminar

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20


Outline

UKP Lab: profile and projects

Administrative course issues

NLP 4 Web Introduction

NLP Basics / Linguistic Analysis

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21


NLP in the Web – Search Engines

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22


NLP in the Web – Spelling Correction

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23


Question Answering

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 24


NLP in the Web – Machine Translation

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25


NLP in the Web – Speech Recognition

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 26


NLP in the Web – Plagiarism Detection

http://de.guttenplag.wikia.com/

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27


NLP in the Web – Summarization

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28


NLP in the Web – Diachronic Analysis

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29


NLP in the Web – Text Generators

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30


Natural Language Processing and the Web

▪ The web is an application area for NLP, e.g.:


▪ Information retrieval:
• Search engines
• Question answering
• News aggregation
• Recommender Systems
• Chatbots…
▪ Web is a resource to improve the quality of NLP, e.g.:
▪ Web as a corpus
▪ Analyzing web-based knowledge repositories
• Wikipedia
• Wiktionary
▪ Recognizing synonyms, paraphrases and the like

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31


Challenges for NLP

• How to remove noise, e.g. duplicates?

• How to assess the quality of content?

• How to integrate the content of heterogeneous and scattered nature?

• How to deal with errors, e.g. spelling or grammar errors?

• How to „clean“ the data?

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32


Data Cleansing is Necessary

▪ User-generated content contains errors, smileys, abbreviations, etc.

Hi
Micheal,
have u seen my
posting,last week u said that u
will look in to my problem thsi week.can i ask u
now?

Data import Data cleansing

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33


Outline

UKP Lab: profile and projects

Administrative course issues

NLP 4 Web Introduction

NLP Basics / Linguistic Analysis

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34


Analysis Levels in Language Understanding

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 35
Phonetics and Phonology

(c) David Groome, 2006

night

Homophones /naɪt/

knight

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36


Analysis Levels in Language Understanding

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37
Segmentation

(c) David Groome, 2006

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38


Tokenization

▪ Segmenting an input stream into an ordered sequence of units is called


tokenization.
▪ A token can correspond to an inflected word form or sub-word units,
and may be subject to a subsequent morphological analysis.
▪ Tokens include punctuation!

▪ A system which splits texts into tokens is called a tokenizer

A very simple example:


▪ Input text:
John likes Mary and Mary likes John.
▪ Tokens:
{"John", "likes", "Mary", "and", "Mary", "likes", "John", "."}

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39


Tokenization

English Example
▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very
positive.“ In New York Stock Exchange composite trading yesterday, Sea
Containers closed at $62.625, up 62.5 cents.

Where could be problems for a tokenizer?

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40


Tokenization

English Example
▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very
positive.“ In New York Stock Exchange composite trading yesterday, Sea
Containers closed at $62.625, up 62.5 cents.

▪ Split at whitespace characters?


cents. said, positive.” $62.625,

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41


Tokenization Ambiguities

Period
▪ In most of the cases: Final sentence punctuation symbol
▪ Part of an abbreviation, e.g. F.D.P.
▪ Numbers, ordinal numbers, e.g.: 21., numbers with fractions, e.g. 1.543
▪ References to resources locators, e.g.: www.apple.com
▪ To complicate things, if a sentence ends with an abbreviation which
ends with a period, only one period is written. “I go to Apple, Inc.”
▪…

Whitespace character
▪ Part of numbers, e.g. “1 543”
▪ No segmentation character in multi-word expressions
▪ “New York”
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42
Ambiguities

Comma
▪ Part of numbers, e.g. 1,543

Single quote
▪ Within tokens to mark contractions and elisions, e.g. English: don’t,
won’t, you’ve, James’ new hat; German: Ich hab’s!
▪ Part of a token in French, e.g. aujourd´hui
▪ But in most cases: Enclosing quoted groups of words

Dash
▪ A delimiter, if it connects strings of digits, e.g. "see pages 100-101”
▪ In French: Signal a close connection between two tokens, e.g. verb and
personal pronoun: donne-le
▪ In most cases, however, it is part of the token, e.g. multi-word
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43
Tokenization in Other Languages

Chinese
爱国人
▪ No spaces
▪ Two possible segmentations, both of them are syntactically and
semantically correct
▪ Disambiguation can only be done with contextual information

爱国 / 人
country-loving person

爱 / 国人
love country-person
Bird et al., NLP with Python, p.113
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44
German Compounds

German
STAUBECKEN
▪ No spaces within noun compounds
▪ Two possible segmentations, both of them are syntactically and
semantically correct
▪ Disambiguation can only be done with contextual information

STAU BECKEN
water reservoir

STAUB ECKEN
dusty corners

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45


Analysis Levels in Language Understanding

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46
Morphology

• Morphology is the branch of linguistics that studies word forms and word
formation
• Words are composed of morphemes
• Morphemes are the smallest meaning-bearing units

(c) David Groome, 2006

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 47


Morphology

Words can be further decomposed into smaller units:

“pneumonoultramicroscopicsilicovolcanoconiosis”

lung disease caused by the inhalation of very fine


silica dust found in volcanoes

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 48


Bases and Affixes

• Remember: Morphemes are the smallest meaning-bearing units


• Examples:
▪ cats → cat (noun) + s (plural)
▪ unknowingly → un + know + ing + ly
▪ bedenken → be + denk + en
▪ Both cat and cats can be uttered in isolation but s cannot:
-s is a bound morpheme

▪ Minimal free morphemes = stems


▪ cat is a free morpheme
▪ Stems carry the main meaning of the word
▪ Affixes are bound morphemes

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 49


Types of Affixes

Suffixes: appear after the base


▪ cat + s, nice + ly

Prefixes: appear before the base


▪ un + true

Infixes: appear inside the base


▪ fan + bloody + tastic

Circumfixes: appear on both sides of the base


▪ ge + sag + t
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 50
Morphological Normalization

▪ Morphological normalization consists in identifying a single


canonical representative for morphologically related word-
forms

Methods
▪Stemming
▪Lemmatization
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51
Stemming

Stemming is an algorithmic approach to strip off the endings of words


sitting → sitt
anarchism, anarchy, anarchistic → anarchi

Objective: group words belonging to the same morphological family by


transforming them into the same stemmed representation

▪ stemming does not distinguish between inflection and derivation


▪ the stems obtained do not necessarily correspond to a real word form

Well-known stemming algorithms for English have been developed by


Lovins and Porter

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 52


Algorithmic Stemming Method

Stemming is rule-based. Example rules from Porter:

*ATIONAL -> *ATE (relational -> relate)

*[> 0 vowels] + ING -> * (monitoring -> monitor)

*SSES -> *SS (grasses -> grass)

Rule-based stemming methods are hard to create, often yield arbitrary


distinctions, but can be executed very quickly at runtime.

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 53


Porter's Stemmer

Original Word Stemmed Word


vision vision
visible visibl
visibility visibl
visionary visionari
visioner vision
visual visual

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 54


Stemming Errors

Under-stemming: remove too little


▪ adhere → adher
▪ adhesion → adhes

Over-stemming: remove too much


▪ appendicitis → append
▪ append → append

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 55


Problem with Stemming: Syntactic Ambiguity

Homographs: words which have the same spelling but different meanings

I saw the saw

Past form Singular form


of the verb
SEE
≠ of the noun
SAW

Such cases cannot be properly dealt with by stemming only,


the word's grammatical category also has to be identified

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 56


Lemmatization

▪ “undo” the inflectional changes of a base form


▪ Usually needs lexical resources and part-of-speech tagging
▪cats (NOUN) → cat
▪left (VERB) → leave
▪left (ADJ) → left

▪Has to deal with Irregularities


▪ sing, sang, sung → sing
▪ indices → index
▪ Bäume → Baum

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 57


Stemming vs. Lemmatization

Original Stemmed Lemmatized


visibilities visibl visibility
adhere adher adhere
adhesion adhes adhesion
appendicitis append appendicitis
oxen oxen ox
indices indic index
swum swum swim

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 58


Analysis Levels in Language Understanding

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 59
Syntax

▪ Syntax refers to the way words are arranged together

▪ "Syntax is the study of the regularities and constraints of


word order and phrase structure"
(Manning & Schütze, 2003, p. 93)

▪ There is an infinite number of ways in which words can be


arranged together to form sentences

▪ Yet, we can understand sentences we have never heard or


read before

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 60


POS Tagging

▪ The process of assigning a part of speech or lexical class marker to


each word in a corpus
▪ The input to a tagging algorithm is a sequence of words and a tagset, and
the output is a sequence of tags, a single best tag for each word

Determiner Noun Verb Pronoun Adjective

(c) David Groome, 2006

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 61


Parts of Speech

▪ In English we traditionally have 8 parts of speech

▪N Noun chair, bandwidth, pacing


▪V Verb study, debate, munch
▪ ADJ Adjective purple, tall, ridiculous
▪ ADV Adverb unfortunately, slowly
▪P Preposition of, by, to
▪ PRO Pronoun I, me, mine
▪ DET Determiner the, a, that, those
▪ INTJ Interjection oh!, m-hm, huh?

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 62


Penn Treebank Tagset

1. CC Coord. conjunc. 25. TO to


2. CD Cardinal number 26. UH Interjection
3. DT Determiner 27. VB V, base form
4. EX Existential there 28. VBD V, past tense
5. FW Foreign word 29. VBG V, gerund/pres. part.
6. IN Prep./subord. conj. 30. VBN V, past part. Language Tagset Size
7. JJ Adject. 31. VBP V, non-3rd ps. sing. pres.
8. JJR Adject., comp. 32. VBZ V, 3rd ps. sing. pres. English 139
9. JJS Adject., superl. 33. WDT wh-det.
10. LS List item marker 34. WP wh-pronoun Czech 970
11. MD Modal 35. WP$ Poss. wh-pronoun
12. NN Noun, sing. or mass 36. WRB wh-adverb
Estonian 476
13. NNS Noun, plural 37. # Pound sign Hungarian 401
14. NNP Proper noun, sing. 38. $ Dollar sign
15. NNPS Proper noun, plural 39. . Sent.-final punct. Romanian 486
16. PDT Predeterminer 40. , Comma
17. POS Possessive ending 41. : Colon, semi-colon Slovene 1033
18. PRP Personal pronoun 42. ( L. bracket char.
19. PP$ Poss. pronoun 43. ) R. bracket char.
(Hajič, 2000)
20. RB Adverb 44.“ Straight dbl. quote
21. RBR Adverb, comp. 45. ‘ L. open sngl. quote
22. RBS Adverb, superl. 46. “ L. open dbl. quote
23. RP Particle 47. ’ R. close sngl. quote
24. SYM Symbol 48. ” R. close dbl. quote
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 63
An Example

WORD LEMMA TAG

the the +DET


host host +NOUN
kissed kiss +VPAST
the the +DET
friend friend +NOUN
on on +PREP
the the +DET
cheek cheek +NOUN

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 64


Ambiguities

▪ POS Tagging is a disambiguation task


▪ Words are ambiguous—have more than one possible part-of-speech
▪ The word “book”:
▪ book that flight: verb
▪ hand me that book: noun
▪ The word “that”:
▪ Does that flight serve dinner? : determiner
▪ I thought that your flight was earlier: complementizer

▪ POS Tagging: resolves ambiguities, choosing the proper tag for the context
▪ Baseline: Most Frequent Class (accuracy 92.34% [Jurafsky & Martin])
▪ Outdated: Rule-based tagging, probabilistic tagging
▪ State of the art: Neural approaches, accuracy ~ 98%

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 65


Parsing

▪ The process of determining the grammatical structure with respect to a


given grammar.

(c) David Groome, 2006

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 66


Alternative representations

▪ Bracketed notation:
[S [NP [Det the] [N dog] ] [VP [V ate] [NP [Det a] [N cookie] ] ] ]
▪ Parenthesized notation:
(S Parse Tree:
(NP
(Det the)
(N dog) )
(VP
(V ate)
(NP
(Det a)
(N cookie))))

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 67


Syntactic Ambiguity

▪If you love money problems show up


▪ If you love, money problems show up.
▪ If you love money, problems show up.
▪ If you love money problems, show up.
▪“I made her duck.”
▪“We're eating grandpa!” vs. "We're eating, grandpa!"
▪“Weil er drei Monate verfallene Medikamente nahm, ...”

▪Different interpretations are mainly caused by syntactic


ambiguity.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 68
Syntactic Ambiguities:
Two Possible Parsing Possibilities

“I saw the man with a telescope.”

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 69


Syntactic Ambiguities:
Two Possible Parsing Possibilities

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 70


Analysis Levels in Language Understanding

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 71
Definition

Semantics:
▪ Study of the meaning of words, phrases, sentences, or documents

Lexical Semantics
▪ Study of the meaning of lexical units, i.e. words.

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 72


Lexical Ambiguity

He hit the ball with the bat.


Chuck Norris can hit a bat with a ball.

▪ Different interpretations are caused by lexical ambiguity.

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 73


Analysis Levels in Language Understanding

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 74
Pragmatics

What is the purpose of an utterance?

“I never said she stole my money" I simply didn't ever say it.

▪ “I never said she stole my money” Someone else said it, but I didn't.
▪ “I never said she stole my money” I might have implied it in some way,
but I never explicitly said it.
▪ “I never said she stole my money” I said someone took it; I didn't say it
was she.
▪ “I never said she stole my money” I just said she probably borrowed it.
▪ “I never said she stole my money” I said she stole someone else's
money.
▪ “I never said she stole my money” I said she stole something of mine,
but not my money.

Example from Wikipedia


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 75
Pragmatics

What is the purpose of an utterance?

Utterance: “Is it cold in here or is it just me?


Intended meaning: “Please close the window!”

Utterance: “Oh, great! Another meeting.”


Intended meaning: The speaker likely means the opposite of what they are
literally saying—meetings might be something they dislike, despite the
positive tone.

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 76


Summary – Linguistic Analysis Levels

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 77
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics and Discourse


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 78
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

Phonetics and Phonology

Segmentation

Morphology

Syntax

Semantics

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 79
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]

Segmentation

Morphology

Syntax

Semantics

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 80
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]

["Elementary", ",", "my", "dear", "Watson"]

Morphology

Syntax

Semantics

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 81
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]

["Elementary", ",", "my", "dear", "Watson"]

Base: Element, Suffix: -ary

Syntax

Semantics

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 82
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]

["Elementary", ",", "my", "dear", "Watson"]

Base: Element, Suffix: -ary

ADJ, PRP$ ADJ NNP

Semantics

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 83
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]

["Elementary", ",", "my", "dear", "Watson"]

Base: Element, Suffix: -ary

ADJ, PRP$ ADJ NNP

Watson: Dr. John H. Watson (not IBM)

Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 84
Summary – Linguistic Analysis Levels
Elementary, my dear Watson

[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]

["Elementary", ",", "my", "dear", "Watson"]

Base: Element, Suffix: -ary

ADJ, PRP$ ADJ NNP

Watson: Dr. John H. Watson (not IBM)

"You are so stupid…"


WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 85
Take-Home-Messages

▪ Natural language processing is an interesting topic ☺


▪ There are a lot of challenges 

▪ Typical preprocessing steps:


▪ Tokenization for splitting texts into tokens
▪ Stemming / Lemmatization to normalize tokens
▪ PoS-Tagging and parsing analyze syntactic features
▪ PoS-tags roughly represent word classes
▪ Phrases group words to function as a single unit

▪ Ambiguity in language makes analysis a hard problem

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 86


Next Lecture

Text Classification

WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 87

You might also like