0% found this document useful (0 votes)
30 views38 pages

NLP Cmu

The document provides an overview of natural language processing and related technologies. It discusses tasks like speech recognition, text analysis, machine translation and summarization. It also covers challenges like ambiguity and the need for large datasets to power modern NLP systems.

Uploaded by

zeyeon zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views38 pages

NLP Cmu

The document provides an overview of natural language processing and related technologies. It discusses tasks like speech recognition, text analysis, machine translation and summarization. It also covers challenges like ambiguity and the need for large datasets to power modern NLP systems.

Uploaded by

zeyeon zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Natural

 Language  Processing  

Taylor  Berg-­‐Kirkpatrick  –  CMU  


Slides:  Dan  Klein  –  UC  Berkeley  
 
Language  Technologies  

Goal:  Deep  Understanding   Reality:  Shallow  Matching  


§ Requires  context,  linguisEc   § Requires  robustness  and  scale  
structure,  meanings…   § Amazing  successes,  but  
fundamental  limitaEons    
Speech  Systems  
§ AutomaEc  Speech  RecogniEon  (ASR)  
§ Audio  in,  text  out  
§ SOTA:  0.3%  error  for  digit  strings,  5%  dictaEon,  50%+  TV  
 
 
 
 
 
 
  “Speech Lab”
 
§ Text  to  Speech  (TTS)  
§ Text  in,  audio  out  
§ SOTA:  totally  intelligible  (if  someEmes  unnatural)  
 
Example:  Siri  
§ Siri  contains  
§ Speech  recogniEon  
§ Language  analysis  
§ Dialog  processing  
§ Text  to  speech  

Image:  Wikipedia  
Text  Data  is  Superficial  

An iceberg is a large piece of freshwater ice that has


broken off from a snow-formed glacier or ice shelf and
is floating in open water.
…  But  Language  is  Complex  

An iceberg is a large piece of


freshwater ice that has broken off
from a snow-formed glacier or ice
shelf and is floating in open water.

§ SemanEc  structures  
§ References  and  enEEes  
§ Discourse-­‐level  connecEves  
§ Meanings  and  implicatures  
§ Contextual  factors  
§ Perceptual  grounding    
§ …    
SyntacEc  Analysis  

Hurricane Emily howled toward Mexico 's Caribbean coast on Sunday


packing 135 mph winds and torrential rain and causing panic in Cancun ,
where frightened tourists squeezed into musty shelters .

§ SOTA:  ~90%  accurate  for  many  languages  when  given  many  training  
examples,  some  progress  in  analyzing  languages  given  few  or  no  examples  
Corpora  
§ A  corpus  is  a  collecEon  of  text  
§ O^en  annotated  in  some  way  
§ SomeEmes  just  lots  of  text  
§ Balanced  vs.  uniform  corpora  

§ Examples  
§ Newswire  collecEons:  500M+  words  
§ Brown  corpus:  1M  words  of  tagged  
“balanced”  text  
§ Penn  Treebank:  1M  words  of  parsed  
WSJ  
§ Canadian  Hansards:  10M+  words  of  
aligned  French  /  English  sentences  
§ The  Web:  billions  of  words  of  who  
knows  what  
Corpus-­‐Based  Methods  
§ A  corpus  like  a  treebank  gives  us  three  important  tools:  
§ It  gives  us  broad  coverage  

ROOT → S
S → NP VP .
NP → PRP
VP → VBD ADJ
Corpus-­‐Based  Methods  
§ It  gives  us  staEsEcal  informaEon  

All NPs NPs under S NPs under VP


23%
21%

11%
9% 9% 9%
6% 7%
4%

NP PP DT NN PRP NP PP DT NN PRP NP PP DT NN PRP


Corpus-­‐Based  Methods  
§ It  lets  us  check  our  answers  
SemanEc  Ambiguity  
§ NLP  is  much  more  than  syntax!  
§ Even  correct  tree  structured  syntacEc  analyses  don’t  fully  nail  
down  the  meaning  

I haven’t slept for ten days

John’s boss said he was doing better

§ In  general,  every  level  of  linguisEc  structure  comes  with  its  


own  ambiguiEes…  
Other  Levels  of  Language  
§ TokenizaEon/morphology:  
§ What  are  the  words,  what  is  the  sub-­‐word  structure?  
§ O^en  simple  rules  work  (period  a^er  “Mr.”  isn’t  sentence  break)  
§ RelaEvely  easy  in  English,  other  languages  are  harder:  
§ SegementaEon  

§ Morphology  

sarà andata
be+fut+3sg go+ppt+fem
“she will have gone”
§ Discourse:  how  do  sentences  relate  to  each  other?  
§ PragmaEcs:  what  intent  is  expressed  by  the  literal  meaning,  how  to  react  
to  an  ujerance?  
§ PhoneEcs:  acousEcs  and  physical  producEon  of  sounds  
§ Phonology:  how  sounds  pajern  in  a  language  
QuesEon  Answering  
§ QuesEon  Answering:  
§ More  than  search  
§ Ask  general  
comprehension  quesEons  
of  a  document  collecEon  
§ Can  be  really  easy:  “What’s  
the  capital  of  Wyoming?”  
§ Can  be  harder:  “How  many  
US  states’  capitals  are  also  
their  largest  ciEes?”  
§ Can  be  open  ended:  “What  
are  the  main  issues  in  the  
global  warming  debate?”  
 
§ SOTA:  Can  do  factoids,  
even  when  text  isn’t  a  
perfect  match  
Example:  Watson  
SummarizaEon  
§ Condensing  
documents  
§ An  example  of  
analysis  with  
generaEon  
ExtracEve  Summaries  
Lindsay Lohan pleaded not guilty Wednesday to felony grand theft of a
$2,500 necklace, a case that could return the troubled starlet to jail rather
than the big screen. Saying it appeared that Lohan had violated her
probation in a 2007 drunken driving case, the judge set bail at $40,000 and
warned that if Lohan was accused of breaking the law while free he would
have her held without bail. The Mean Girls star is due back in court on Feb.
23, an important hearing in which Lohan could opt to end the case early.
Machine  TranslaEon  

§ Translate  text  from  one  language  to  another  


§ Recombines  fragments  of  example  translaEons  
§ Challenges:  
§ What  fragments?    [learning  to  translate]  
§ How  to  make  efficient?    [fast  translaEon  search]  
§ Fluency  (next  class)  vs  fidelity  (later)  
More  Data:  Machine  TranslaEon  

Cela constituerait une solution transitoire qui permettrait de


SOURCE conduire à terme à une charte à valeur contraignante.

That would be an interim solution which would make it possible to


HUMAN work towards a binding charter in the long term .

[this] [constituerait] [assistance] [transitoire] [who] [permettrait]


1x DATA [licences] [to] [terme] [to] [a] [charter] [to] [value] [contraignante] [.]

[it] [would] [a solution] [transitional] [which] [would] [of] [lead]


10x DATA [to] [term] [to a] [charter] [to] [value] [binding] [.]

[this] [would be] [a transitional solution] [which would] [lead to] [a


100x DATA charter] [legally binding] [.]

[that would be] [a transitional solution] [which would] [eventually


1000x DATA
lead to] [a binding charter] [.]
Data  and  Knowledge  
§ Classic  knowledge  representaEon  worry:  How  will  a  
machine  ever  know  that…  
§ Ice  is  frozen  water?  
§ Beige  looks  like  this:  
§ Chairs  are  solid?  

§ Answers:  
§ 1980:  write  it  all  down  
§ 2000:  get  by  without  it  
§ 2020:  learn  it  from  data  
Deeper  Understanding:  Reference  
Names  vs.  EnEEes  
Example  Errors  
Discovering  Knowledge  
Grounded  Language  
Grounding  with  Natural  Data  
… on the beige loveseat.
What  is  Nearby  NLP?  
§ ComputaEonal  LinguisEcs  
§ Using  computaEonal  methods  to  learn  more  
about  how  language  works  
§ We  end  up  doing  this  and  using  it  

§ CogniEve  Science  
§ Figuring  out  how  the  human  brain  works  
§ Includes  the  bits  that  do  language  
§ Humans:  the  only  working  NLP  prototype!  

§ Speech  Processing  
§ Mapping  audio  signals  to  text  
§ TradiEonally  separate  from  NLP,  converging?  
§ Two  components:  acousEc  models  and  language  
models  
§ Language  models  in  the  domain  of  stat  NLP  
Example:  NLP  Meets  CL  

§ Example:  Language  change,  reconstrucEng  ancient  forms,  phylogenies  


                           …  just  one  example  of  the  kinds  of  linguisEc  models  we  can  build  
What  is  NLP  research?  
§ Three  aspects  we  o^en  invesEgate:  
§ LinguisEc  Issues  
§ What  are  the  range  of  language  phenomena?  
§ What  are  the  knowledge  sources  that  let  us  disambiguate?  
§ What  representaEons  are  appropriate?  
§ How  do  you  know  what  to  model  and  what  not  to  model?  
§ StaEsEcal  Modeling  Methods  
§ Increasingly  complex  model  structures  
§ Learning  and  parameter  esEmaEon  
§ Efficient  inference:  dynamic  programming,  search,  sampling  
§ Engineering  Methods  
§ Issues  of  scale  
§ Where  the  theory  breaks  down  (and  what  to  do  about  it)  
Some  Early  NLP  History  
§ 1950’s:  
§ FoundaEonal  work:  automata,  informaEon  theory,  etc.  
§ First  speech  systems  
§ Machine  translaEon  (MT)  hugely  funded  by  military  
§ Toy  models:  MT  using  basically  word-­‐subsEtuEon  
§ OpEmism!  
§ 1960’s  and  1970’s:  NLP  Winter  
§ Bar-­‐Hillel  (FAHQT)  and  ALPAC  reports  kills  MT  
§ Work  shi^s  to  deeper  models,  syntax  
§ …  but  toy  domains  /  grammars  (SHRDLU,  LUNAR)  
§ 1980’s  and  1990’s:  The  Empirical  RevoluEon  
§ ExpectaEons  get  reset  
§ Corpus-­‐based  methods  become  central  
§ Deep  analysis  o^en  traded  for  robust  and  simple  approximaEons  
§ Evaluate  everything  
§ 2000+:  Richer  StaEsEcal  Methods  
§ Models  increasingly  merge  linguisEcally  sophisEcated  representaEons  with  staEsEcal  
methods,  confluence  and  clean-­‐up  
§ Begin  to  get  both  breadth  and  depth  
Problem:  Structure  

§ Headlines:  
§ Enraged  Cow  Injures  Farmer  with  Ax  
§ Teacher  Strikes  Idle  Kids  
§ Hospitals  Are  Sued  by  7  Foot  Doctors  
§ Ban  on  Nude  Dancing  on  Governor’s  Desk  
§ Iraqi  Head  Seeks  Arms  
§ Stolen  PainEng  Found  by  Tree  
§ Kids  Make  NutriEous  Snacks  
§ Local  HS  Dropouts  Cut  in  Half  

§ Why  are  these  funny?  


Problem:  Scale  
§ People  did  know  that  language  was  ambiguous!  
§ …but  they  hoped  that  all  interpretaEons  would  be  “good”  ones  (or  
ruled  out  pragmaEcally)  
§ …they  didn’t  realize  how  bad  it  would  be  

ADJ
DET NOUN
DET NOUN

PLURAL NOUN

NP PP
NP NP

CONJ
Classical  NLP:  Parsing  
§ Write  symbolic  or  logical  rules:  
Grammar (CFG) Lexicon

ROOT → S NP → NP PP NN → interest
S → NP VP VP → VBP NP NNS → raises
NP → DT NN VP → VBP NP PP VBP → interest
NP → NN NNS PP → IN NP VBZ → raises

§ Use  deducEon  systems  to  prove  parses  from  words  


§ Minimal  grammar  on  “Fed  raises”  sentence:  36  parses  
§ Simple  10-­‐rule  grammar:  592  parses  
§ Real-­‐size  grammar:  many  millions  of  parses  

§ This  scaled  very  badly,  didn’t  yield  broad  coverage  tools  


Problem:  Sparsity  
§ However:  sparsity  is  always  a  problem  
§ New  unigram  (word),  bigram  (word  pair),  and  rule  rates  in  
newswire  

1
0.9
0.8
Fraction Seen

0.7
0.6 Unigrams
0.5
0.4 Bigrams
0.3
0.2
0.1
0
0 200000 400000 600000 800000 1000000
Number of Words
The  (EffecEve)  NLP  Cycle  
§ Pick  a  problem  (usually  some  disambiguaEon)  
§ Get  a  lot  of  data  (usually  a  labeled  corpus)  
§ Build  the  simplest  thing  that  could  possibly  work  
§ Repeat:  
§ Examine  the  most  common  errors  are  
§ Figure  out  what  informaEon  a  human  might  use  to  avoid  them  
§ Modify  the  system  to  exploit  that  informaEon  
§ Feature  engineering  
§ RepresentaEon  redesign  
§ Different  machine  learning  methods  
§ We’re  do  this  over  and  over  again  

You might also like