0% found this document useful (0 votes)
7 views20 pages

Lecture01 2020 TheNLPPipeline

The document discusses various tasks and challenges in Natural Language Processing (NLP), including tokenization, part-of-speech tagging, syntactic parsing, and semantic analysis. It highlights the importance of statistical models to address ambiguity and the need for explicit representations in the NLP pipeline. Additionally, it contrasts traditional NLP approaches with modern neural methods, emphasizing their strengths and limitations.

Uploaded by

유영준
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

Lecture01 2020 TheNLPPipeline

The document discusses various tasks and challenges in Natural Language Processing (NLP), including tokenization, part-of-speech tagging, syntactic parsing, and semantic analysis. It highlights the importance of statistical models to address ambiguity and the need for explicit representations in the NLP pipeline. Additionally, it contrasts traditional NLP approaches with modern neural methods, emphasizing their strengths and limitations.

Uploaded by

유영준
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1 : t

u re t h a
e c t e r
7 L p u t
4 m x t : e
C S 4 c o ’ te l i n
g a d s i p e
i n ta n P p
i ld e r s N L
Bu und n a l
‘ i t i o
r ad
e t
T h

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 1


What does it take to understand text?
死亡⾕测得54.4摄⽒度⾼温 Çavuşoğlu'ndan Atina'ya uyarı:
美国加州名胜或破世界纪录 Bazı ülkelerin dolduruşuna gelip,
kendinizi riske atmayın

รอยัลลิสต์มาร์เก็ตเพลส: เฟซบุ๊ก
เตรียมดำเนินทางการกฎหมายกับรัฐบาล
ไทย หลังบังคับบล็อกการเข้าถึงกลุ่มปิดที่
พูดคุยเกี่ยวกับราชวงศ์
ኣብ ሳዋ ዝወሃብ መበል 12
ክፍሊ ትምህርቲ ክቋረጽ ጎስጓስ
ይካየድ ኣሎ
Qabiyyeen xalayaa dhimma Obbo
Lidatu Ayyaaloorratti MM Abiyyiif
barraa'e maali?
'Dim angen cau tafarndai a bwytai i
ailagor ysgolion'

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2


Task: Tokenization/segmentation
死亡⾕测得54.4摄⽒度⾼温 รอยัลลิสต์มาร์เก็ตเพลส: เฟซบุ๊ก
เตรียมดำเนินทางการกฎหมายกับรัฐบาล
美国加州名胜或破世界纪录
ไทย หลังบังคับบล็อกการเข้าถึงกลุ่มปิดที่
พูดคุยเกี่ยวกับราชวงศ์

We need to split text into words and sentences.


Languages like Chinese or Thai don’t have spaces between words.
Even in English, this cannot be done deterministically:
There was an earthquake near D.C. You could even feel it in Philadelphia, New York, etc.

NLP task:
What is the most likely segmentation/tokenization?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 3


Task: Part-of-speech-tagging
Open the pod door, Hal.

Verb Det Noun Noun , Name .


Open the pod door , Hal .
open:
verb, adjective, or noun?
Verb: open the door
Adjective: the open door
Noun: in the open
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 4
How do we decide?
We want to know the most likely tags T
for the sentence S
argmax P(T |S)
T

We need to define a statistical model of P(T | S), e.g.:


argmax
argmax P(T
P(T
argmax P(T |S)|S)
|S) === argmax
argmax P(T
P(T )P(S|T
argmax P(T )P(S|T )
)P(S|T )
)
T TT T TT
P(T
P(T
P(T ) )) ==
= de
f ff
dede ’
’’i i ii i1 1
P(t
P(t
P(t |t i
|t|t i ) 1))
i i
P(S|T
P(S|T
P(S|T ) )) ==
= f f ’
de
dede f ’
’ i ii i
P(w
P(w
P(w | i)|||i)t)i)
i ii
We need to estimate the parameters of P(T |S), e.g.:
P( ti =V | ti-1 =N ) = 0.312

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 5


Disambiguation requires
statistical models
Ambiguity is a core problem for any NLP task

Statistical models* are one of the main tools


to deal with ambiguity.
*more generally: a lot of the models (classifiers, structured prediction models)
you learn about in CS446 (Machine Learning) can be used for this purpose.
You can learn more about the connection to machine learning in CS546 (Machine Learning in Natural Language).

These models need to be trained (estimated, learned)


before they can be used (tested, evaluated).
We will see lots of examples in this class
(CS446 is NOT a prerequisite for CS447)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 6


“I made her duck”
What does this sentence mean?
“I made her crouch”, “I cooked duck for her”, “I cooked her [pet] duck (perhaps just for myself)”, …

“duck”: noun or verb?


“make”: “cook X” or “cause X to do Y” ?
“her”: “for her” or “belonging to her” ?

Language has different kinds of ambiguity, e.g.:


Structural ambiguity
“I eat sushi with tuna” vs. “I eat sushi with chopsticks”
“I saw the man with the telescope on the hill”

Lexical (word sense) ambiguity


“I went to the bank”: financial institution or river bank?

Referential ambiguity
“John saw Jim. He was drinking coffee.” Who was drinking coffee?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7


“I made her duck cassoulet”
(Cassoulet = a French bean casserole)

The second major problem in NLP is coverage:


We will always encounter unfamiliar words
and constructions.

Our models need to be able to deal with this.

This means that our models need to be able


to generalize from what they have been trained on
to what they will be used on.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8


Task: Syntactic parsing

S
VP
NP
NOUN NP

Verb Det Noun Noun , Name .


Open the pod door , Hal .
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9
Observation: Structure corresponds to meaning

Correct analysis
VP
NP
PP
V NP P NP
eat sushi with tuna eat sushi with tuna
VP

VP PP
V NP P NP
eat sushi with chopsticks eat sushi with chopsticks

Incorrect analysis
VP

VP PP
V NP P NP
eat sushi with tuna eat sushi with tuna
VP
NP
PP
V NP P NP
eat sushi with chopsticks eat sushi with chopsticks

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10


Question: what is grammar?
Grammar formalisms (= linguists’ programming languages)
A precise way to define and describe
the structure of sentences.

Specific grammars (= linguists’ programs)


Implementations (in a particular formalism)
for a particular language (English, Chinese,....)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11


Overgeneration
John Mary saw.

with tuna sushi ate I.


Did you go there?
English Did you went there?
I want you to go there.
....
I ate the cake that John had
made for me yesterday
John and Mary eat sushi for
dinner.
.....
John saw Mary.
I ate sushi with tuna.

Undergeneration
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12
NLP and automata theory
What kind of grammar/automaton
is required to analyze natural language?

What class of languages does


natural language fall into?

Chomsky (1956)’s hierarchy of formal languages


was originally developed to answer (some of)
these questions.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 13


Task: Semantic analysis

∃x∃y(pod_door(x) & Hal(y)


& request(open(x, y)))

S
VP
NP
NOUN NP

Verb Det Noun Noun , Name .


Open the pod door , Hal .
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 14
Representing meaning
We need a meaning representation language.

“Shallow” semantic analysis: Template-filling


(Information Extraction)
– Named-Entity Extraction: Organizations, Locations, Dates,...
– Event Extraction

“Deep” semantic analysis: (Variants of) formal logic


∃x∃y(pod_door(x)& Hal(y) & request(open(x,y)))

We also distinguish between


Lexical semantics (the meaning of words) and
Compositional semantics (the meaning of sentences)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 15
Understanding texts
More than a decade ago, Carl Lewis stood on the threshold of what was to
become the greatest athletics career in history. He had just broken two of
the legendary Jesse Owens' college records, but never believed he would
become a corporate icon, the focus of hundreds of millions of dollars in
advertising. His sport was still nominally amateur.
Eighteen Olympic and World Championship gold medals and 21 world
records later, Lewis has become the richest man in the history of track and
field – a multi-millionaire.

Who is Carl Lewis?


Did Carl Lewis break any world records?
(and how do you know that?)
Is Carl Lewis wealthy? What about Jesse Owens?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 16


Summary: The NLP Pipeline
An NLP system may use some or all
of the following steps:

Tokenizer/Segmenter
– to identify words and sentences
Morphological analyzer/POS-tagger
– to identify the part of speech and structure of words
Word sense disambiguation
– to identify the meaning of words
Syntactic/semantic Parser
– to obtain the structure and meaning of sentences
Coreference resolution/discourse model
– to keep track of the various entities and events mentioned
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 17
NLP Pipeline: Assumptions
Each step in the NLP pipeline embellishes the input with
explicit information about its linguistic structure
– POS tagging: parts of speech of word,
– Syntactic parsing: grammatical structure of sentence,….

Each step in the NLP pipeline requires


its own explicit (“symbolic”) output representation:
– POS tagging requires a POS tag set
(e.g. NN=common noun singular, NNS = common noun plural, …)
– Syntactic parsing requires constituent or dependency labels
(e.g. NP = noun phrase, or nsubj = nominal subject)

These representations should capture


linguistically appropriate generalizations/abstractions
– Designing these representations requires linguistic expertise
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18
NLP Pipeline: Shortcomings
Each step in the pipeline relies on a learned model
that will return the most likely representations
This requires a lot of annotated training data for each step
Annotation is expensive and sometimes difficult
(people are not 100% accurate)
These models are never 100% accurate
Models make more mistakes if their input contains mistakes

How do we know that we have captured the “right” generalizations


when designing representations?
Some representations are easier to predict than others
Some representations are more useful for the next steps
in the pipeline than others
But we won’t know how easy/useful a representation is
until we have a model that we can plug into a particular pipeline

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19


Sidestepping the NLU pipeline
Many current neural approaches for natural language understanding and
generation go directly from the raw input to the desired final output.

With large amounts of training data, this often works better


than the traditional approach.
— We will soon discuss why this may be the case.

But these models don’t solve everything:


— How do we incorporate knowledge, reasoning, etc. into these models?
— What do we do when don’t have much training data?
(e.g. when we work with a low-resource language)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20

You might also like