0% found this document useful (0 votes)
7 views93 pages

SPand MT

Uploaded by

juggikapoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views93 pages

SPand MT

Uploaded by

juggikapoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Lecture 4: Semantic Parsing and Machine

Translation

Kyle Richardson

[email protected]

April 27, 2016


Lecture Plan

I paper: Wong and Mooney (2006)


I general topics: Synchronous CFGs, Decoding by parsing,
word-alignment and rule extraction.

2
The Big Picture (reminder)
I Standard processing pipeline

(FOR EVERY X /
Semantic Parsing MAJORELT : T;
input sem (FOR EVERY Y /
SAMPLE : (CONTAINS Y X);
(PRINTOUT Y)))
List samples that contain
every major element Knowledge Representation

Interpretation

world

JsemK ={S10019,S10059,...}

Lunar QA system (Woods (1973))

3
Semantic Parsing: Generating formal representations
I Data-driven: Given data, learn a function that can map any
given input (x) to a meaning representation (z).
I What kind of data do we learn from?

input x What state has the largest population?

sem z (argmax (λx. (state x) (population x)))

world JzK California

Geoquery Corpus (Zelle and Mooney (1996))


4
Previously: Learning from meaning representations (again)
data: (x =two times two plus three,y = (plus (mult 2 2) 3))

I Compositional model : a semantic context-free grammar.


I Learning Model: Greedy string → tree rule induction (SILT)
I Other Topics
I Non-greedy parsing using (P)CFGs and dynamic programming,
the CKY algorithm.
I Maximum-Likelihood estimation, Expectation Maximization
and latent variables, inside-outside probabilities.

5
Previous Session: Transformation rules
I Decompose translation into a set of local transformations.

data: (x =two multiplied by two plus three,y = (plus (mult 2 2) 3))


(plus (mult 2 2) 3)

(mult 2 2) plus N:3

N: 2 mult N:2 + 3

2 * 2

N : 2
N : 2 mult

2
r1: 2 , r2: * , r1: ’two’ −→

6
Bottom-up, String → Tree Rule Matching

RULE
MR Grammar
CONDITION DIRECTIVE
RULE −→ CONDITION DIRECTIVE
CONDITION −→ bowner TEAM UNUM
bowner TEAM UNUM do TEAM UNUM ACTION DIRECTIVE −→ do TEAM UNUM ACTION
TEAM −→ our
UNUM −→ 4
our 4 our 4 shoot ACTION −→ shoot

Transformation: If TEAM player 4 has the ball, TEAM player 4 should shoot.
Input: If our player 4 has the ball, our player 4 should shoot.

7
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.

8
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.
I Technically: transformation rules, common in MT.

8
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.
I Technically: transformation rules, common in MT.

I Idea: Recast the problem as a statistical MT task.

8
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.
I Technically: transformation rules, common in MT.

I Idea: Recast the problem as a statistical MT task.


I Components:

8
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.
I Technically: transformation rules, common in MT.

I Idea: Recast the problem as a statistical MT task.


I Components:
I Synchronous grammar model

8
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.
I Technically: transformation rules, common in MT.

I Idea: Recast the problem as a statistical MT task.


I Components:
I Synchronous grammar model
I Alignment-based rule extraction

8
Semantic Parsing and Machine Translation

I Conceptually: problem is treated as a kind of machine translation


problem.
I Dataset: D = {(xi , yi )}ni=1 , xj sentence, yj (semantic)
translation.
I Technically: transformation rules, common in MT.

I Idea: Recast the problem as a statistical MT task.


I Components:
I Synchronous grammar model
I Alignment-based rule extraction
I Probabilistic decoding and ranking model (more next lecture)

8
Context-Free Grammars (again)

I context-free grammar (CFG):

G = (Σ, N, S, R)

I N : set of non-terminal symbols.


I Σ: set terminal symbols.
I R : set of rules = {N → α | α ∈ (N ∪ Σ)∗ }
I S : start symbol

I Context-free language: defines a set of strings

I Derivation: A tree representation of rule application on input.


I Semantic Parsing: Semantic representations and composition rules take
the form of non-terminal rules in derivations.

9
Previous examples
I Derivation trees encode the semantic rules.

I example: u = two times two plus three

N: (plus (mult 2 2) 3)

N : (mult 2 2) R : plus N : 3

N : 2 R : mult N : 2 plus three

two times two

language G ={two times two, two times two plus three, ...}

10
Synchronous Context-Free Grammars (extension)
I synchronous context-free grammar (SCFG):

G Syn = (Σe , Σf , N, S, R)

I N : (shared) set of non-terminal symbols (as before).


I Σe : english terminal symbols.
I Σf : foreign (or semantic) terminal symbols.
I R : set of rules of the form:

N → hα, βi

I α ∈ (N ∪ Σe ), β ∈ (N ∪ Σf )
I S : start symbol: hS1 , S2 i

I SCF Language: defines a set of string pairs

I Allows us to more explicitly relate input and output.

11
Machine Translation Example

I Example: English → Japanese synchronous grammar.

I Notation: subscripts on each non-terminal N are used to relate rules on


each side. These rules must be paired in each rule. 1

S −→ hNP1 VP2 , NP1 VP2 i


VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1
example from Chiang and Knight (2006)
12
Machine Translation: Example Derivation

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

13
Machine Translation: Example Derivation

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

Derivation
S ⇒ hNP11 VP12 , NP11 , VP12 i

13
Machine Translation: Example Derivation

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

Derivation
S ⇒ hNP11 VP12 , NP11 , VP12 i
⇒ hNP11 V13 NP14 , NP11 NP14 V13 i

13
Machine Translation: Example Derivation

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

Derivation
S ⇒ hNP11 VP12 , NP11 , VP12 i
⇒ hNP11 V13 NP14 , NP11 NP14 V13 i
⇒ hI V13 NP14 , watashi wa NP14 V13 i

13
Machine Translation: Example Derivation

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

Derivation
S ⇒ hNP11 VP12 , NP11 , VP12 i
⇒ hNP11 V13 NP14 , NP11 NP14 V13 i
⇒ hI V13 NP14 , watashi wa NP14 V13 i
⇒ hI open NP14 , watashi wa NP14 akemasu i

13
Machine Translation: Example Derivation

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

Derivation
S ⇒ hNP11 VP12 , NP11 , VP12 i
⇒ hNP11 V13 NP14 , NP11 NP14 V13 i
⇒ hI V13 NP14 , watashi wa NP14 V13 i
⇒ hI open NP14 , watashi wa NP14 akemasu i
⇒ hI open the box , watashi wa hako wo akemasu i

13
SCFGs

I SCFG language: defines a set of sentence pairs

G Syn = {(I open the box, watashi wa hako wo akemasu), ...}

I derivation: a pair of trees.


S S !
NP VP , NP VP

I V NP watashi wa NP VP

open the box hako wo akemasu

14
Two Variants of Parsing

I Parsing pairs: Given an english text and foreign text, generate a


synchronous derivation using a grammar G Syn (bitext parsing)

(I open the box, watashi wa hako wo akemasu) → derivation


I Translation or Decoding: Given an english text, translate it into a
foreign text using a grammar G Syn

I open the box → watashi wa hako wo akemasu

15
Two Variants of Parsing

I Parsing pairs: Given an english text and foreign text, generate a


synchronous derivation using a grammar G Syn (bitext parsing)

(I open the box, watashi wa hako wo akemasu) → derivation


I Translation or Decoding: Given an english text, translate it into a
foreign text using a grammar G Syn

I open the box → watashi wa hako wo akemasu

I Surprisingly: The first problem is much harder than the second (despite
more information). We will only consider the second.

15
Decoding by parsing (i.e., Translation)

I Assuming we have binary rules, we can use the CKY algorithm (last
lecture) for parsing.
I Idea: Parse the english side of the grammar in the normal way, then
apply or project foreign side of rules.
I Why does this work? Synchronous rules have the same LHSs.

16
Decoding by Parsing: Parse English Side

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0
0 I 1 open 2 the box 3
1
2

17
Decoding by Parsing: Parse English Side

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0 NP → I
0 I 1 open 2 the box 3
1
2

18
Decoding by Parsing: Parse English Side

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0 NP → I
0 I 1 open 2 the box 3
1 V → open

19
Decoding by Parsing: Parse English Side

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0 NP → I
0 I 1 open 2 the box 3
1 V → open

2 NP → the box

20
Decoding by Parsing: Parse English Side

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0 NP → I
0 I 1 open 2 the box 3
1 V → open VP → V NP

2 NP → the box

21
Decoding by Parsing: Parse English Side

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0 NP → I S → NP VP
0 I 1 open 2 the box 3
1 V → open VP → V NP

2 NP → the box

22
Decoding by Parsing: Projection

Grammar:
S −→ hNP1 VP2 , NP1 VP2 i
VP −→ hV1 NP2 , NP2 V1 i
NP −→ hI, watashi wai
NP −→ hthe box, hako woi
V −→ hopen, akemasui

1 2 3
0 NP → I, watashi wa S → NP VP, NP VP

0 I 1 open 2 the box 3 1 V → open akemasu VP → V NP, NP V

2 NP → the box

hako wo

23
Binarization (brief reminder/review)
I CKY algorithm (last week) assumes input grammar is in Chomsky
normal-form (binary rules and unary pre-terminal rules only).
I Why? input: w1 w2 w3 w4

24
Binarization (brief reminder/review)
I CKY algorithm (last week) assumes input grammar is in Chomsky
normal-form (binary rules and unary pre-terminal rules only).
I Why? input: w1 w2 w3 w4

binary (one split) binary+ternary (two splits)


(w1 , w2 ) (w1 , w2 )
(w2 , w3 ) (w2 , w3 )
(w3 , w4 ) (w3 , w4 )
(w1 , w2 w3 ) (w1 , w2 w3 )
(w1 w2 , w3 ) (w1 w2 , w3 )
(w2 w3 , w4 ) (w2 w3 , w4 )
... ...
(w1 , w2 , w3 )
(w1 w2 , w3 , w4 )
...

24
Binarization (brief reminder/review)
I CKY algorithm (last week) assumes input grammar is in Chomsky
normal-form (binary rules and unary pre-terminal rules only).
I Why? input: w1 w2 w3 w4

binary (one split) binary+ternary (two splits)


(w1 , w2 ) (w1 , w2 )
(w2 , w3 ) (w2 , w3 )
(w3 , w4 ) (w3 , w4 )
(w1 , w2 w3 ) (w1 , w2 w3 )
(w1 w2 , w3 ) (w1 w2 , w3 )
(w2 w3 , w4 ) (w2 w3 , w4 )
... ...
(w1 , w2 , w3 )
(w1 w2 , w3 , w4 )
...
I Problem: Unlike normal CFGs, SCFGs cannot be binarized in the
general case.
24
History: Syntax-Directed Translation

I First developed as a method for programming language compilation (i.e.


translating high-level languages to lower-level languages)

move ax, 1 !
for i in range(10):
n += i
, loop: add bx,ax
cmp ax, 10
jle loop

I Analogy: We can think of semantic parsing as a form of language compilation.

25
Big Idea: Wong and Mooney (2006)

I Transformation Rules: recast the string-to-tree rewrite rules (last class,


Kate et al. (2005)) as synchronous grammars rules.
I Rule Extraction: SCFGs are extracted using a word alignment model
(as done in other approaches to MT)

26
Semantic Parsing and Syntax-driven Translation
Grammar:
RULE −→ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 DIRECTIVE2 )i
CONDITION −→ hTEAM1 player UNUM2 has the ball , (bowner TEAM1 {UNUM}2 )i
TEAM −→ hour, ouri
UNUM −→ hfour, 4i

27
Semantic Parsing and Syntax-driven Translation
Grammar:
RULE −→ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 DIRECTIVE2 )i
CONDITION −→ hTEAM1 player UNUM2 has the ball , (bowner TEAM1 {UNUM}2 )i
TEAM −→ hour, ouri
UNUM −→ hfour, 4i

Deriv.
RULE ⇒ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 , DIRECTIVE2 )i

27
Semantic Parsing and Syntax-driven Translation
Grammar:
RULE −→ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 DIRECTIVE2 )i
CONDITION −→ hTEAM1 player UNUM2 has the ball , (bowner TEAM1 {UNUM}2 )i
TEAM −→ hour, ouri
UNUM −→ hfour, 4i

Deriv.
RULE ⇒ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 , DIRECTIVE2 )i
⇒ hif TEAM1 player UNUM2 has the ball DIR.2 , ((bowler TEAM1 {UNUM2 }, DIR2 )i

27
Semantic Parsing and Syntax-driven Translation
Grammar:
RULE −→ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 DIRECTIVE2 )i
CONDITION −→ hTEAM1 player UNUM2 has the ball , (bowner TEAM1 {UNUM}2 )i
TEAM −→ hour, ouri
UNUM −→ hfour, 4i

Deriv.
RULE ⇒ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 , DIRECTIVE2 )i
⇒ hif TEAM1 player UNUM2 has the ball DIR.2 , ((bowler TEAM1 {UNUM2 }, DIR2 )i
⇒ hif our player UNUM2 has the ball DIR.2 , ((bowler our {UNUM2 }, DIR2 )i

27
Semantic Parsing and Syntax-driven Translation
Grammar:
RULE −→ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 DIRECTIVE2 )i
CONDITION −→ hTEAM1 player UNUM2 has the ball , (bowner TEAM1 {UNUM}2 )i
TEAM −→ hour, ouri
UNUM −→ hfour, 4i

Deriv.
RULE ⇒ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 , DIRECTIVE2 )i
⇒ hif TEAM1 player UNUM2 has the ball DIR.2 , ((bowler TEAM1 {UNUM2 }, DIR2 )i
⇒ hif our player UNUM2 has the ball DIR.2 , ((bowler our {UNUM2 }, DIR2 )i
...
⇒ h If our player four has the ball, then our player six ... ,

((bowner our {4})(do our {6} (pos (left (half our))))) i

27
Semantic Parsing and Syntax-driven Translation
Grammar:
RULE −→ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 DIRECTIVE2 )i
CONDITION −→ hTEAM1 player UNUM2 has the ball , (bowner TEAM1 {UNUM}2 )i
TEAM −→ hour, ouri
UNUM −→ hfour, 4i

Deriv.
RULE ⇒ hif CONDITION1 DIRECTIVE2 , ( CONDITION1 , DIRECTIVE2 )i
⇒ hif TEAM1 player UNUM2 has the ball DIR.2 , ((bowler TEAM1 {UNUM2 }, DIR2 )i
⇒ hif our player UNUM2 has the ball DIR.2 , ((bowler our {UNUM2 }, DIR2 )i
...
⇒ h If our player four has the ball, then our player six ... ,

((bowner our {4})(do our {6} (pos (left (half our))))) i

I Is this grammar in CNF?

27
Rule Extraction and Alignment

I Lexical Acquisition: finding optimal word alignments between NL


sentences and meaning representation (MR) fragments.
I Assumes (as in Kate et al. (2005)) a deterministic MR grammar.
I For alignment, MR is represented as a sequence of productions.

28
Rule Extraction and Alignment

I Lexical Acquisition: finding optimal word alignments between NL


sentences and meaning representation (MR) fragments.
I Assumes (as in Kate et al. (2005)) a deterministic MR grammar.
I For alignment, MR is represented as a sequence of productions.

28
Word-based alignment models (basics)

I Basic idea: Treat translation as a process of translating individual words2

Das Haus ist klein

the house is small

I Alignment function: a : i → j, (i english word to j foreign word)

a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}

2
Examples from Koehn (2009) and some of his slides.
29
Word-based alignment models (basics)

I Basic idea: Treat translation as a process of translating individual words3

Das Haus ist klitzeklein

the house is very small

I Alignment function: a : i → j, (i english word to j foreign word)

a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}

I One-to-many: foreign might translate to multiple english words.

3
Examples from Koehn (2009)
30
Word-based alignment models (basics)

I Basic idea: Treat translation as a process of translating individual words4

NULL Das Haus ist klein

the house is just small

I Alignment function: a : i → j, (i english word to j foreign word)

a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4}

I Null translation: english words might not have foreign


translations.

4
Examples from Koehn (2009)
31
Word-based alignment models (basics)

I Translation probability: defined as t(ei | fj ), or probability of english


word ei given a foreign word fj , s.t.
X
t(e | fj ) = 1.0
e

32
Word-based alignment models (basics)

I Translation probability: defined as t(ei | fj ), or probability of english


word ei given a foreign word fj , s.t.
X
t(e | fj ) = 1.0
e


 0.5 e = small
 0.2 e = tiny


t(. | klein) = 0.2 e = little
0.05 e = the




0.05 e = house

32
IBM Model 1

I IBM Model 1: Based entirely on translation (or lexical) probabilities


(Brown et al. (1993)).
I english sentence: e1 , .., ele
I foreign sentence: f1 , .., flf

33
IBM Model 1

I IBM Model 1: Based entirely on translation (or lexical) probabilities


(Brown et al. (1993)).
I english sentence: e1 , .., ele
I foreign sentence: f1 , .., flf

I Translation probability with alignment:

le
1 Y
p(e, a | f ) = t(ej | fa(j) )
(lf + 1)le j=1

I (lf + 1)le , the number of total alignments (assuming Null word).

33
IBM Model 1

I Translation probability with alignment:

le
1 Y
p(e, a | f ) = l
t(ej | fa(j) )
(lf + 1) e j=1

Das Haus ist klein

the house is small


I a : {1 → 1t(the|Das)=0.7 , 2 → 2t(house|Haus)=0.8 , 3 → 3...0.8 , 4 → 4...0.4 }
1
p(e, a | f ) = ∗ 0.7 ∗ 0.8 ∗ 0.8 ∗ 0.4 = 0.0029
54

34
IBM Model 1

I Translation probability with alignment:

le
1 Y
p(e, a | f ) = t(ej | fa(j) )
(lf + 1)le j=1

I (Overall) Translation probability:


X
p(e | f ) = p(e, a | f )
a

35
IBM Model 1

I Translation probability with alignment:

le
1 Y
p(e, a | f ) = t(ej | fa(j) )
(lf + 1)le j=1

I (Overall) Translation probability:


X
p(e | f ) = p(e, a | f )
a

I Problem: Requires summing over all alignments


I e.g., le = lf = 10 this equals (10 + 1)10 = 25, 937, 424, 601
alignments (Penn treebank, aver. somewhere near 27 words).

35
IBM Model 1
I Luckily, we can get around this (using some basic math).
I (Overall) Translation probability:

X
p(e | f ) = p(e, a | f )
a
lf lf
X X
= ... p(e, a | f )
a(1)=0 a(le )=0
lf lf le
X X 1 Y
= ... t(ej | fa(j) )
(lf + 1)le j=1
a(1)=0 a(le )=0
lf lf le
1 X X Y
= ... t(ej | fa(j) )
(lf + 1)le j=1
a(1)=0 a(le )=0
lf
le X
1 Y
= t(ej | fj )
(lf + 1)le j=1 i=0

36
IBM Model 1

I Luckily, we can get around this (using some basic math).

I (Overall) Translation probability:

X
p(e | f ) = p(e, a | f )
a
le Xlf
1 Y
= t(ej | fj )
(lf + 1)le j=1 i=0

I e = my friend, f = mein freund (without Null)

2
p(my friend | mein freund) = ((t(my | mein)+t(my | freund))∗(t(friend | mein)+t(friend | freund)))/2

37
Learning a Model1 aligner

I Requires learning translation probabilities t(ei | fj )


I Maximum Likelihood Estimation (MLE) (with full information)

count(ei , fj )
t(ei , fj ) = P
e count(e, fj )

38
Learning a Model1 aligner

I Requires learning translation probabilities t(ei | fj )


I Maximum Likelihood Estimation (MLE) (with full information)

count(ei , fj )
t(ei , fj ) = P
e count(e, fj )

I Problem: We don’t have full information (i.e, target alignments)


I Expecation Maximization (EM):

38
Learning a Model1 aligner

I Requires learning translation probabilities t(ei | fj )


I Maximum Likelihood Estimation (MLE) (with full information)

count(ei , fj )
t(ei , fj ) = P
e count(e, fj )

I Problem: We don’t have full information (i.e, target alignments)


I Expecation Maximization (EM):
I Initialize parameters randomly (or uniformly)

38
Learning a Model1 aligner

I Requires learning translation probabilities t(ei | fj )


I Maximum Likelihood Estimation (MLE) (with full information)

count(ei , fj )
t(ei , fj ) = P
e count(e, fj )

I Problem: We don’t have full information (i.e, target alignments)


I Expecation Maximization (EM):
I Initialize parameters randomly (or uniformly)
I e-step: Run the current model on your data, collect counts.

38
Learning a Model1 aligner

I Requires learning translation probabilities t(ei | fj )


I Maximum Likelihood Estimation (MLE) (with full information)

count(ei , fj )
t(ei , fj ) = P
e count(e, fj )

I Problem: We don’t have full information (i.e, target alignments)


I Expecation Maximization (EM):
I Initialize parameters randomly (or uniformly)
I e-step: Run the current model on your data, collect counts.
I m-step: Update parameters based on previous step.

38
Learning a Model1 aligner

I Requires learning translation probabilities t(ei | fj )


I Maximum Likelihood Estimation (MLE) (with full information)

count(ei , fj )
t(ei , fj ) = P
e count(e, fj )

I Problem: We don’t have full information (i.e, target alignments)


I Expecation Maximization (EM):
I Initialize parameters randomly (or uniformly)
I e-step: Run the current model on your data, collect counts.
I m-step: Update parameters based on previous step.
I Repeat last two steps until convergence.

38
EM for IBM Model1

Koehn (2009)

39
Model1 as a Translation Model

I Word decoding : Model1 can be used as translation model.


X
p(e | f ) = p(e, a | f )
a

I Nowadays, such models are used for extracting alignments, which are the
basis of more complex translation models (e.g. our syntax-based model).

I Viterbi alignment: Find the most likely alignment given a pair (easy,
find for each word ei the most likely fj )

ai = arg maxj∈{0...lf } t(ei | fj )

I K-best alignments: Can be extended to extract top k alignments.

40
Other IBM Models
I IBM Models 2-5: Go beyond using only the lexical translation
probabilities.

the1 man wearing the2 coat

the person with the jacket

41
Other IBM Models
I IBM Models 2-5: Go beyond using only the lexical translation
probabilities.

the1 man wearing the2 coat

the person with the jacket

I IBM Model2: adds an alignment probability distribution: a(i | j, le , lf ),


which considers relative word position and sentence length:
le
Y
p(e, a | f ) = t(ej | fa(j) )a(a(j) | j, le , lf )
j=1

41
Other IBM Models
I IBM Models 2-5: Go beyond using only the lexical translation
probabilities.

the1 man wearing the2 coat

the person with the jacket

I IBM Model2: adds an alignment probability distribution: a(i | j, le , lf ),


which considers relative word position and sentence length:
le
Y
p(e, a | f ) = t(ej | fa(j) )a(a(j) | j, le , lf )
j=1

I Model1: t(e4 | f1 ) = t(e4 | f4 )


I Model2: t(e4 | f1 )a(1 | 4, 5, 5) < t(e4 | f4 )a(4 | 4, 5, 5)

41
Other IBM Models

I Model1: lexical translation probabilities, bag-of-words.


I Model2: alignment probability distribution: a(i | j, le , lf )
I Model3: fertility distribution n(φ | f ), or distribution over the number
of words each fj usually translates to.

n(1 | haus) = 1.0, n(2 | klitzeklein) = 1.0, ...

I Model4: relative distortion, word classes.


I Model5: fixes deficiency problem.

42
Back to Semantic Parsing: Rule extraction (Wong and Mooney

(2006))

I Extraction: Train IBM Model5 over english sentences and sequences of


MR productions, and extract rules from 10-best alignment.
I Important: productions are used instead of MR tokens, allows for
skipping pieces without meaning.

43
Back to Semantic Parsing: Rule extraction (Wong and Mooney

(2006))

I Extraction: Train IBM Model5 over english sentences and sequences of


MR productions, and extract rules from 10-best alignment.
I Important: productions are used instead of MR tokens, allows for
skipping pieces without meaning.

43
Rule extraction (Wong and Mooney (2006))

I Extraction: Bottom-up (as done last week), starting from alignments


with terminal symbols, then working to more complex rules.

I alignment where RHS of production rule is a MR terminal:


I TEAM → hour, ouri, UNUM → h4, 4i, ...

44
Rule extraction (Wong and Mooney (2006))

I Extraction: Bottom-up (as done last week), starting from alignments


with terminal symbols, then working to more complex rules.

I alignment where RHS of production rule is a MR terminal:


I TEAM → hour, ouri, UNUM → h4, 4i, ...
I Move to more complex rules (adjust to account for sub patterns, skip
words by writing (num)):
I COND. →
hTEAM1 player UNUM2 has (1) ball, (bowner TEAM1 { UNUM2 } )i

44
Similar methods: Hiero rule extraction

I Specialized version of methods used for other types of syntax-based


decoding, e.g., hierarchical phrase-based translation (Chiang (2005))
I Does not require syntactic rules or analyses, learns them from
scratch.

30 duonianlai de youhao hezuo

friendly cooperation over the last 30 years

45
Similar methods: Hiero rule extraction

I Specialized version of methods used for other types of syntax-based


decoding, e.g., hierarchical phrase-based translation (Chiang (2005))
I Does not require syntactic rules or analyses, learns them from
scratch.

30 duonianlai de youhao hezuo

friendly cooperation over the last 30 years

X1 → h30, 30i

45
Similar methods: Hiero rule extraction

I Specialized version of methods used for other types of syntax-based


decoding, e.g., hierarchical phrase-based translation (Chiang (2005))
I Does not require syntactic rules or analyses, learns them from
scratch.

30 duonianlai de youhao hezuo

friendly cooperation over the last 30 years

X1 → h30, 30i
X2 → hfriendly cooperation, youhao hezuoi

45
Similar methods: Hiero rule extraction

I Specialized version of methods used for other types of syntax-based


decoding, e.g., hierarchical phrase-based translation (Chiang (2005))
I Does not require syntactic rules or analyses, learns them from
scratch.

30 duonianlai de youhao hezuo

friendly cooperation over the last 30 years

X1 → h30, 30i
X2 → hfriendly cooperation, youhao hezuoi
X3 → hover the last X1 years , X1 duonianlaii

45
Similar methods: Hiero rule extraction

I Specialized version of methods used for other types of syntax-based


decoding, e.g., hierarchical phrase-based translation (Chiang (2005))
I Does not require syntactic rules or analyses, learns them from
scratch.

30 duonianlai de youhao hezuo

friendly cooperation over the last 30 years

X1 → h30, 30i
X2 → hfriendly cooperation, youhao hezuoi
X3 → hover the last X1 years , X1 duonianlaii
X4 → hX2 X3 , X3 X2 i

45
Extension to logical variables

I So far, has been used on functional representations.


I λ-Wasp (Wong and Mooney (2007)) extends rules extraction to handle
logical and lambda variables, of the type:

A → hα, λx1 , ..., λxn βi

form. → smallest(x2 ,(form.,form.)) form. → state(x1 ) form. → area(x1 ,x2 )

smallest state by area

46
Extension to logical variables

I So far, has been used on functional representations.


I λ-Wasp (Wong and Mooney (2007)) extends rules extraction to handle
logical and lambda variables, of the type:

A → hα, λx1 , ..., λxn βi

form. → smallest(x2 ,(form.,form.)) form. → state(x1 ) form. → area(x1 ,x2 )

smallest state by area

form → hstate, λx1 . state(x)i

46
Extension to logical variables

I So far, has been used on functional representations.


I λ-Wasp (Wong and Mooney (2007)) extends rules extraction to handle
logical and lambda variables, of the type:

A → hα, λx1 , ..., λxn βi

form. → smallest(x2 ,(form.,form.)) form. → state(x1 ) form. → area(x1 ,x2 )

smallest state by area

form → hstate, λx1 . state(x)i


form → hby area, λx1 .λy2 . area(x, y )i

46
Extension to logical variables

I So far, has been used on functional representations.


I λ-Wasp (Wong and Mooney (2007)) extends rules extraction to handle
logical and lambda variables, of the type:

A → hα, λx1 , ..., λxn βi

form. → smallest(x2 ,(form.,form.)) form. → state(x1 ) form. → area(x1 ,x2 )

smallest state by area

form → hstate, λx1 . state(x)i


form → hby area, λx1 .λy2 . area(x, y )i
form → hsmallest form1 form2 , λx1 . smallest(x2 , (form1 (x1 ), form2 (x1 , x2 )))i

46
Probabilistic Model

I Lexical/rule induction: Over-generates, leading to many derivations.

I Extend the SCFG to a weighted SCFG (the synchronous analogue of the


PCFG), which defines a probability distribution over derivations.
I Goal is to discriminative different derivations, and find an output
translation f ∗ where

f ∗ = m(arg maxd∈D(G |e) Prλ (d | e))

47
Probabilistic Model

I Lexical/rule induction: Over-generates, leading to many derivations.

I Extend the SCFG to a weighted SCFG (the synchronous analogue of the


PCFG), which defines a probability distribution over derivations.
I Goal is to discriminative different derivations, and find an output
translation f ∗ where

f ∗ = m(arg maxd∈D(G |e) Prλ (d | e))

I D(G | e): The set of derivations given an english input e.


I Computed using dynamic-programming and something close to the
inside-outside algorithm (last week)
I Prλ (d | e): Training a log-linear model on example derivations (more on
this next week).

47
Overview and Take-aways

I Recasting the semantic parsing problem as an MT task.

48
Overview and Take-aways

I Recasting the semantic parsing problem as an MT task.


I Synchronous grammars: modeling NL-MR transformations and
decoding by parsing.
I Word-Alignment Models: Basics, extracting semantic grammar
transformation rules
I Decoding and ranking models: Skipped over important details,
more about this next week

48
Overview and Take-aways

I Recasting the semantic parsing problem as an MT task.


I Synchronous grammars: modeling NL-MR transformations and
decoding by parsing.
I Word-Alignment Models: Basics, extracting semantic grammar
transformation rules
I Decoding and ranking models: Skipped over important details,
more about this next week
I Further directions
I Different tree-based translation models (Ehsen), more powerful
translation models (Mariia)
I Different rule extraction techniques: Li et al. (2013)

48
Roadmap

I Lecture 3 (today): rule extraction, decoding (MT perspective)


I Lecture 4: Structure prediction and classification (missing today).

49
References I
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The
mathematics of statistical machine translation: Parameter estimation.
Computational linguistics, 19(2):263–311.
Chiang, D. (2005). A hierarchical phrase-based model for statistical machine
translation. In Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics, pages 263–270. Association for Computational
Linguistics.
Chiang, D. and Knight, K. (2006). An introduction to synchronous grammars.
Tutorial available at http://www. isi. edu/ chiang/papers/synchtut. pdf.
Kate, R. J., Wong, Y. W., and Mooney, R. J. (2005). Learning to transform natural
to formal languages. In Proceedings of AAAI-2005.
Koehn, P. (2009). Statistical machine translation. Cambridge University Press.
Li, P., Liu, Y., and Sun, M. (2013). An extended ghkm algorithm for inducing
lambda-scfg. In AAAI.
http://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/view/6189.
Wong, Y. W. and Mooney, R. J. (2006). Learning for semantic parsing with statistical
machine translation. In Proceedings of HLT-NAACL-2006, pages 439–446.
http://dl.acm.org/citation.cfm?id=1220891.
Wong, Y. W. and Mooney, R. J. (2007). Learning synchronous grammars for semantic
parsing with lambda calculus. In Proceedings of ACL-2007, Prague, Czech
Republic. http://anthology.aclweb.org/P/P07/P07-1121.pdf.

50
References II

Woods, W. A. (1973). Progress in natural language understanding: an application to


lunar geology. In Proceedings of the June 4-8, 1973, National Computer
Conference and Exposition, pages 441–450.
Zelle, J. M. and Mooney, R. J. (1996). Learning to parse database queries using
inductive logic programming. In Proceedings of AAAI-1996, pages 1050–1055.

51

You might also like