Nlp4web Lecture 1 Intro
Nlp4web Lecture 1 Intro
Lecture 1
Introduction
Information
Introductory
Management
▪ Machine Learning
▪ Einführung in die künstliche Intelligenz (Kersting)
▪ Data Mining und maschinelles Lernen (Kersting)
▪ Deep Learning (Kersting)
▪ Computer Vision
▪ Computer Vision 1 and 2 (Roth)
• Website:
www.ukp.tu-darmstadt.de
• GitHub:
www.github.com/UKPLab
• Social Media:
@UKPLab
Information Overload
Business Intelligence
Constantly updated:
Nr. Lecture
01 Introduction / NLP basics
02 Foundations of Text Classification
03 IR – Introduction, Evaluation
04 IR – Word Representation, Data Collection
05 IR – Re-Ranking Methods
06 IR – Language Domain Shifts, Dense / Sparse Retrieval
07 LLM – Language Modeling Foundations
08 LLM – Neural LLM, Tokenization
09 LLM – Transformers, Self-Attention
10 LLM – Adaption, LoRa, Prompting
11 LLM – Alignment, Instruction Tuning
12 LLM – Long Contexts, RAG
13 LLM – Scaling, Computation Cost
14 Review & Preparation for the Exam
▪ Computer Science?
▪ Bachelor?
▪ Master?
▪ Other disciplines?
▪ FoLT
▪ Ethics in Natural Language Processing
▪ Deep Learning for NLP
▪ Data Analysis Software Project
▪ Text Analytics / LLM Seminar
http://de.guttenplag.wikia.com/
Hi
Micheal,
have u seen my
posting,last week u said that u
will look in to my problem thsi week.can i ask u
now?
Segmentation
Morphology
Syntax
Semantics
night
Homophones /naɪt/
knight
Segmentation
Morphology
Syntax
Semantics
English Example
▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very
positive.“ In New York Stock Exchange composite trading yesterday, Sea
Containers closed at $62.625, up 62.5 cents.
English Example
▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very
positive.“ In New York Stock Exchange composite trading yesterday, Sea
Containers closed at $62.625, up 62.5 cents.
Period
▪ In most of the cases: Final sentence punctuation symbol
▪ Part of an abbreviation, e.g. F.D.P.
▪ Numbers, ordinal numbers, e.g.: 21., numbers with fractions, e.g. 1.543
▪ References to resources locators, e.g.: www.apple.com
▪ To complicate things, if a sentence ends with an abbreviation which
ends with a period, only one period is written. “I go to Apple, Inc.”
▪…
Whitespace character
▪ Part of numbers, e.g. “1 543”
▪ No segmentation character in multi-word expressions
▪ “New York”
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42
Ambiguities
Comma
▪ Part of numbers, e.g. 1,543
Single quote
▪ Within tokens to mark contractions and elisions, e.g. English: don’t,
won’t, you’ve, James’ new hat; German: Ich hab’s!
▪ Part of a token in French, e.g. aujourd´hui
▪ But in most cases: Enclosing quoted groups of words
Dash
▪ A delimiter, if it connects strings of digits, e.g. "see pages 100-101”
▪ In French: Signal a close connection between two tokens, e.g. verb and
personal pronoun: donne-le
▪ In most cases, however, it is part of the token, e.g. multi-word
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43
Tokenization in Other Languages
Chinese
爱国人
▪ No spaces
▪ Two possible segmentations, both of them are syntactically and
semantically correct
▪ Disambiguation can only be done with contextual information
爱国 / 人
country-loving person
爱 / 国人
love country-person
Bird et al., NLP with Python, p.113
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44
German Compounds
German
STAUBECKEN
▪ No spaces within noun compounds
▪ Two possible segmentations, both of them are syntactically and
semantically correct
▪ Disambiguation can only be done with contextual information
STAU BECKEN
water reservoir
STAUB ECKEN
dusty corners
Segmentation
Morphology
Syntax
Semantics
• Morphology is the branch of linguistics that studies word forms and word
formation
• Words are composed of morphemes
• Morphemes are the smallest meaning-bearing units
“pneumonoultramicroscopicsilicovolcanoconiosis”
Methods
▪Stemming
▪Lemmatization
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51
Stemming
Homographs: words which have the same spelling but different meanings
Segmentation
Morphology
Syntax
Semantics
▪ POS Tagging: resolves ambiguities, choosing the proper tag for the context
▪ Baseline: Most Frequent Class (accuracy 92.34% [Jurafsky & Martin])
▪ Outdated: Rule-based tagging, probabilistic tagging
▪ State of the art: Neural approaches, accuracy ~ 98%
▪ Bracketed notation:
[S [NP [Det the] [N dog] ] [VP [V ate] [NP [Det a] [N cookie] ] ] ]
▪ Parenthesized notation:
(S Parse Tree:
(NP
(Det the)
(N dog) )
(VP
(V ate)
(NP
(Det a)
(N cookie))))
Segmentation
Morphology
Syntax
Semantics
Semantics:
▪ Study of the meaning of words, phrases, sentences, or documents
Lexical Semantics
▪ Study of the meaning of lexical units, i.e. words.
Segmentation
Morphology
Syntax
Semantics
“I never said she stole my money" I simply didn't ever say it.
▪ “I never said she stole my money” Someone else said it, but I didn't.
▪ “I never said she stole my money” I might have implied it in some way,
but I never explicitly said it.
▪ “I never said she stole my money” I said someone took it; I didn't say it
was she.
▪ “I never said she stole my money” I just said she probably borrowed it.
▪ “I never said she stole my money” I said she stole someone else's
money.
▪ “I never said she stole my money” I said she stole something of mine,
but not my money.
Segmentation
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 77
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Segmentation
Morphology
Syntax
Semantics
Segmentation
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 79
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Segmentation
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 80
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 81
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 82
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 83
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 84
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Text Classification