NLP Cmu
NLP Cmu
Language Processing
Image:
Wikipedia
Text
Data
is
Superficial
§ SemanEc
structures
§ References
and
enEEes
§ Discourse-‐level
connecEves
§ Meanings
and
implicatures
§ Contextual
factors
§ Perceptual
grounding
§ …
SyntacEc
Analysis
§ SOTA:
~90%
accurate
for
many
languages
when
given
many
training
examples,
some
progress
in
analyzing
languages
given
few
or
no
examples
Corpora
§ A
corpus
is
a
collecEon
of
text
§ O^en
annotated
in
some
way
§ SomeEmes
just
lots
of
text
§ Balanced
vs.
uniform
corpora
§ Examples
§ Newswire
collecEons:
500M+
words
§ Brown
corpus:
1M
words
of
tagged
“balanced”
text
§ Penn
Treebank:
1M
words
of
parsed
WSJ
§ Canadian
Hansards:
10M+
words
of
aligned
French
/
English
sentences
§ The
Web:
billions
of
words
of
who
knows
what
Corpus-‐Based
Methods
§ A
corpus
like
a
treebank
gives
us
three
important
tools:
§ It
gives
us
broad
coverage
ROOT → S
S → NP VP .
NP → PRP
VP → VBD ADJ
Corpus-‐Based
Methods
§ It
gives
us
staEsEcal
informaEon
11%
9% 9% 9%
6% 7%
4%
§ Morphology
sarà andata
be+fut+3sg go+ppt+fem
“she will have gone”
§ Discourse:
how
do
sentences
relate
to
each
other?
§ PragmaEcs:
what
intent
is
expressed
by
the
literal
meaning,
how
to
react
to
an
ujerance?
§ PhoneEcs:
acousEcs
and
physical
producEon
of
sounds
§ Phonology:
how
sounds
pajern
in
a
language
QuesEon
Answering
§ QuesEon
Answering:
§ More
than
search
§ Ask
general
comprehension
quesEons
of
a
document
collecEon
§ Can
be
really
easy:
“What’s
the
capital
of
Wyoming?”
§ Can
be
harder:
“How
many
US
states’
capitals
are
also
their
largest
ciEes?”
§ Can
be
open
ended:
“What
are
the
main
issues
in
the
global
warming
debate?”
§ SOTA:
Can
do
factoids,
even
when
text
isn’t
a
perfect
match
Example:
Watson
SummarizaEon
§ Condensing
documents
§ An
example
of
analysis
with
generaEon
ExtracEve
Summaries
Lindsay Lohan pleaded not guilty Wednesday to felony grand theft of a
$2,500 necklace, a case that could return the troubled starlet to jail rather
than the big screen. Saying it appeared that Lohan had violated her
probation in a 2007 drunken driving case, the judge set bail at $40,000 and
warned that if Lohan was accused of breaking the law while free he would
have her held without bail. The Mean Girls star is due back in court on Feb.
23, an important hearing in which Lohan could opt to end the case early.
Machine
TranslaEon
§ Answers:
§ 1980:
write
it
all
down
§ 2000:
get
by
without
it
§ 2020:
learn
it
from
data
Deeper
Understanding:
Reference
Names
vs.
EnEEes
Example
Errors
Discovering
Knowledge
Grounded
Language
Grounding
with
Natural
Data
… on the beige loveseat.
What
is
Nearby
NLP?
§ ComputaEonal
LinguisEcs
§ Using
computaEonal
methods
to
learn
more
about
how
language
works
§ We
end
up
doing
this
and
using
it
§ CogniEve
Science
§ Figuring
out
how
the
human
brain
works
§ Includes
the
bits
that
do
language
§ Humans:
the
only
working
NLP
prototype!
§ Speech
Processing
§ Mapping
audio
signals
to
text
§ TradiEonally
separate
from
NLP,
converging?
§ Two
components:
acousEc
models
and
language
models
§ Language
models
in
the
domain
of
stat
NLP
Example:
NLP
Meets
CL
§ Headlines:
§ Enraged
Cow
Injures
Farmer
with
Ax
§ Teacher
Strikes
Idle
Kids
§ Hospitals
Are
Sued
by
7
Foot
Doctors
§ Ban
on
Nude
Dancing
on
Governor’s
Desk
§ Iraqi
Head
Seeks
Arms
§ Stolen
PainEng
Found
by
Tree
§ Kids
Make
NutriEous
Snacks
§ Local
HS
Dropouts
Cut
in
Half
ADJ
DET NOUN
DET NOUN
PLURAL NOUN
NP PP
NP NP
CONJ
Classical
NLP:
Parsing
§ Write
symbolic
or
logical
rules:
Grammar (CFG) Lexicon
ROOT → S NP → NP PP NN → interest
S → NP VP VP → VBP NP NNS → raises
NP → DT NN VP → VBP NP PP VBP → interest
NP → NN NNS PP → IN NP VBZ → raises
…
1
0.9
0.8
Fraction Seen
0.7
0.6 Unigrams
0.5
0.4 Bigrams
0.3
0.2
0.1
0
0 200000 400000 600000 800000 1000000
Number of Words
The
(EffecEve)
NLP
Cycle
§ Pick
a
problem
(usually
some
disambiguaEon)
§ Get
a
lot
of
data
(usually
a
labeled
corpus)
§ Build
the
simplest
thing
that
could
possibly
work
§ Repeat:
§ Examine
the
most
common
errors
are
§ Figure
out
what
informaEon
a
human
might
use
to
avoid
them
§ Modify
the
system
to
exploit
that
informaEon
§ Feature
engineering
§ RepresentaEon
redesign
§ Different
machine
learning
methods
§ We’re
do
this
over
and
over
again