Inside Outside Algorithm
Inside Outside Algorithm
Michael Collins
Introduction
This note describes the inside-outside algorithm. The inside-outside algorithm has
very important applications to statistical models based on context-free grammars.
In particular, it is used in EM estimation of probabilistic context-free grammars,
and it is used in estimation of discriminative models for context-free parsing.
As we will see, the inside-outside algorithm has many similarities to the forwardbackward algorithm for hidden Markov models. It computes analogous quantities
to the forward and backward terms, for context-free trees.
Basic Definitions
This section gives some basic definitions. We first give definitions for contextfree grammars, and for a representation of parse trees. We then describe potential
functions over parse trees. The next section describes the quantities computed by
the inside-outside algorithm, and the algorithm itself.
The previous class note on PCFGs (posted on the webpage) has full details of
context-free grammars. For the input sentence x1 . . . xn , the CFG defines a set of
possible parse trees, which we will denote as T .
Any parse tree t T can be represented as a set of of rule productions. Each
rule production can take one of two forms:
hA B C, i, k, ji where A B C is a rule in the grammar, and i, k, j
are indices such that 1 i k < j n. A rule production of this form
specifies that the rule A B C is seen with non-terminal A spanning words
xi . . . xj in the input string; non-terminal B spanning words xi . . . xk in the
input string; and non-terminal C spanning words xk+1 . . . xj in the input
string.
hA, ii where A is a non-terminal, and i is an index with i {1, 2, . . . n}. A
rule production of this form specifies that the rule A xi is seen in a parse
tree, with A above the ith word in the input string.
As an example, consider the following parse tree:
S
NP
D
VP
V
(r)
rt
hAB C,i,k,jit
(A B C, i, k, j)
hA,iit
(A, i)
Under these definitions, for any tree t, the potential (t) = rt (r) is simply
the probability for that parse tree under the PCFG.
As a second example, consider a conditional random field (CRF) style model
for parsing with CFGs (see the lecture slides from earlier in the course). In this
case each rule production r has a feature vector (r) Rd , and in addition we
assume a parameter vector v Rd . We can then define the potential functions as
(r) = exp{v (r)}
The potential function for an entire tree is then
(t) =
rt
(r) =
rt
rt
v (r)}
Note that this is closely related to the distribution defined by a CRF-style model:
in particular, under the CRF we have for any tree t
p(t|x1 . . . xn ) = P
(t)
tT (t)
where T again denotes the set of all parse trees for x1 . . . xn under the CFG.
tT
(t).
(t)
tT :rt
(t)
tT :(A,i,j)t
Here we write (A, i, j) t if the parse tree t contains the terminal A spanning words xi . . . xj in the input. For example, in the example parse tree
given before, the following (A, i, j) triples are seen in the tree: hS, 1, 4i;
hNP, 1, 2i; hVP, 3, 4i; hD, 1, 1i; hN, 2, 2i; hV, 3, 3i; hP, 4, 4i.
4
Note that there is a close correspondence between these terms, and the terms
computed by the forward-backward algorithm (see the previous notes).
In words, the quantity Z is the sum of potentials for all possible parse trees
for the input x1 . . . xn . The quantity (r) for any rule production r is the sum of
potentials for all parse trees that contain the rule production r. Finally, the quantity
(A, i, j) is the sum of potentials for all parse trees containing non-terminal A
spanning words xi . . . xj in the input.
We will soon see how these calculations can be applied within a particularly
context, namely EM-based estimation of the parameters of a PCFG. First, however,
we give the algorithm.
(A, i, j) =
j1
X
AB CR k=i
i1
X
BC AR k=1
n
X
BA CR k=j+1
Outputs: Return
Z =
(A, i, j) =
(A, i) =
(A B C, i, k, j) =
(S, 1, n)
(A, i, j) (A, i, j)
(A, i, i)
(A, i, j) (A B C, i, k, j) (B, i, k) (C, k + 1, j)
to be the set of all possible trees rooted in non-terminal A, and spanning words
xi . . . xj in the sentence. Note that under this definition, T = T (S, 1, n) (the full
set of parse trees for the input sentence is equal to the full set of trees rooted in the
symbol S, spanning words x1 . . . xn ).
As an example, for the input sentence the dog saw the man in the park, under
an appropriate CFG, one member of T (NP, 4, 8) would be
NP
PP
NP
D
IN
the
man
in
NP
D
the park
The set T (NP, 4, 8) would be the set of all possible parse trees rooted in NP,
spanning words x4 . . . x8 = the man in the park.
Each t T (A, i, j) has an associated potential, defined in the same way as
before as
Y
(r)
(t) =
rt
We now claim the following: consider the (A, i, j) terms calculated in the
inside-outside algorithm. Then
X
(A, i, j) =
(t)
tT (A,i,j)
Thus the inside term (A, i, j) is simply the sum of potentials for all trees spanning
words xi . . . xj , rooted in the symbol A.
3.3.2
Again, take x1 . . . xn to be the input to the inside-outside algorithm. Now, for any
non-terminal A, for any (i, j) such that 1 i j n, define
O(A, i, j)
to be the set of all outside trees with non-terminal A, and span xi . . . xj .
To illustrate the idea of an outside tree, again consider an example where the
input sentence is the dog saw the man in the park. Under an appropriate CFG, one
member of T (NP, 4, 5) would be
S
NP
VP
the
dog
saw
NP
NP
PP
NP
IN
in
the park
This tree is rooted in the symbol S. The leafs of the tree form the sequence
x1 . . . x3 NP x6 . . . xn .
More generally, an outside tree for non-terminal A, with span xi . . . xj , is a tree
with the following properties:
The tree is rooted in the symbol S.
Each rule in the tree is a valid rule in the underlying CFG (e.g., S -> NP
VP, NP -> D N, D -> the, etc.)
The leaves of the tree form the sequence x1 . . . xi1 A xj+1 . . . xn .
Each outside tree t again has an associated potential, equal to
(t) =
(r)
rt
We simply read off the rule productions in the outside tree, and take their product.
Again, recall that we defined O(A, i, j) to be the set of all possible outside
trees with non-terminal A and span xi . . . xj . We now make the following claim.
Consider the (A, i, j) terms calculated by the inside-outside algorithm. Then
X
(A, i, j) =
(t)
tO(A,i,j)
In words, the outside term for (A, i, j) is the sum of potentials for all outside trees
in the set O(A, i, j).
3.3.3
We now give justification for the Z and terms calculated by the algorithm. First,
consider Z. Recall that we would like to compute
X
Z=
(t)
tT
(A, i, j) =
(t)
tT :(A,i,j)t
VP
the
dog
NP
V
saw
NP
PP
IN
the
man
in
NP
D
the
park
VP
the
dog
saw
NP
NP
PP
IN
NP
in
the
park
the man
It follows that if we denote the outside tree by t1 , the inside tree by t2 , and the
full tree by t, we have
(t) = (t1 ) (t2 )
More generally, we have
X
(A, i, j) =
(t)
(1)
tT :(A,i,j)t
((t1 ) (t2 ))
(2)
t1 O(A,i,j) t2 T (A,i,j)
t1 O(A,i,j)
(t1 )
= (A, i, j) (A, i, j)
t2 T (A,i,j)
(t2 )
(3)
(4)
Eq. 1 follows by definition. Eq. 2 follows because any tree t with non-terminal A
spanning xi . . . xj can be decomposed into a pair (t1 , t2 ) where t1 O(A, i, j),
and t2 T (A, i, j). Eq. 3 follows by simple algebra. Finally, Eq. 4 follows by the
definitions of (A, i, j) and (A, i, j).
A similar argument can be used to justify computing
(r) =
tT :rt
10
(t)
as
(A, i) = (A, i, i)
(A B C, i, k, j) = (A, i, j) (A B C, i, k, j) (B, i, k) (C, k + 1, j)
For brevity the details are omitted.
the sentence, and each xj is a word. The output from the algorithm is a parameter
q(r) for each rule r in the CFG.
The algorithm starts with initial parameters q 0 (r) for each rule r (for example
these parameters could be chosen to be random values). As is usual in EM-based
algorithms, the algorithm defines a sequence of parameter settings q 1 , q 2 , . . . q T ,
where T is the number of iterations of the algorithm.
The parameters q t at the tth iteration are calculated as follows. In a first step,
the inside-outside algorithm is used to calculate expected counts f (r) for each rule
r in the PCFG, under the parameter values q t1 . Once the expected counts are
calculated, the new estimates are
q t (A ) = P
f (A )
AR f (A )
rR
11
q(r)count(t,r)
Given a PCFG, and a sentence x, we can also calculate the conditional probablity
p(t|x; ) = P
p(x, t; )
tTi p(x, t; )
of any t Ti .
Given these definitions, we will show that the expected count f t1 (r) for any
rule r, as calculated in the tth iteration of the EM algorithm, is
f t1 (r) =
n X
X
p(t|x(i) ; t1 )count(t, r)
i=1 tTi
Thus we sum over all training examples (i = 1 . . . n), and for each training example, we sum over all parse trees t Ti for that training example. For each parse tree
t, we multiply the conditional probability p(t|x(i) ; t1 ) by the count count(t, r),
which is the number of times rule r is seen in the tree t.
Consider calculating the expected count of any rule on a single training example; that is, calculating
count(r) =
p(t|x(i) ; t1 )count(t, r)
(5)
tTi
Clearly, calculating this quantity by brute force (by explicitly enumerating all trees
t Ti ) is not tractable. However, the count(r) quantities can be calculated efficiently, using the inside-outside algorithm. Figure 3 shows the algorithm. The
algorithm takes as input a sentence x1 . . . xn , a CFG, and a parameter q t1 (r) for
each rule r in the grammar. In a first step the and Z terms are calculated using
the inside-outside algorithm. In a second step the counts are calculated based on
the and Z terms. For example, for any rule of the form A B C, we have
count(A B C) =
X (A B C, i, k, j)
i,k,j
(6)
where and Z are terms calculated by the inside-outside algorithm, and the sum
is over all i, k, j such that 1 i k < j n.
The equivalence between the definitions in Eqs. 5 and 6 can be justified as
follows. First, note that
count(t, A B C) =
[[hA B C, i, k, ji t]]
i,k,j
Hence
X
p(t|x(i) ; t1 )count(t, A B C)
p(t|x(i) ; t1 )
tTi
[[hA B C, i, k, ji t]]
i,k,j
tTi
XX
i,k,j tTi
X (A B C, i, k, j)
i,k,j
The final equality follows because if we define the potential functions in the insideoutside algorithm as
(A B C, i, k, j) = q t1 (A B C)
(A i) = q t1 (A xi )
then it can be verified that
X
tTi
13
(A B C, i, k, j)
Z
(i)
Inputs: Training examples x(i) for i = 1 . . . n, where each x(i) is a sentence with words xj for
j {1 . . . li } (li is the length of the ith sentence). A CFG (N, , R, S).
Initialization: Choose some initial PCFG parameters q 0 (r) for each r R. (e.g., initialize the
parameters to randomPvalues.) The initial parameters must satisfy the usual constraints that q(r) 0,
and for any A N , AR q(A ) = 1.
Algorithm:
For t = 1 . . . T
For all r, set f t1 (r) = 0
For i = 1 . . . n
Use the algorithm in figure 3 with inputs equal to the sentence x(i) , the CFG
(N, , R, S), and parameters q t1 , to calculate count(r) for each r R. Set
f t1 (r) = f t1 (r) + count(r)
for all r R.
Re-estimate the parameters as
f t1 (A )
t1 (A )
AR f
q t (A ) = P
for each rule A R.
14
X (A B C, i, k, j)
Z
i,k,j
where and Z are terms calculated by the inside-outside algorithm, and the sum is over all
i, k, j such that 1 i k < j n.
For each rule of the form A x,
count(A x) =
X (A, i)
Z
i:x =x
i
15