0% found this document useful (0 votes)
11 views10 pages

Qualitative and Quantitative Models of Speech Translation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Qualitative and Quantitative Models of Speech Translation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Qualitative and Quantitative Models

of Speech Translation
Hiyan Alshawi
AT,~T Bell Laboratories
600 Mountain Avenue
Murray Hill, NJ 07974, USA
hiyan@research, at t.com

Abstract ing done is rarely given, so we will attempt to make


This paper compares a qualitative reasoning model of that explicit here, more or less from first principh's.
translation with a quantitative statistical model. We Before proceeding, I will first examine some fashiou-
consider these models within the context of two hy- able distinctions in section 2 in order to clarify the is-
pothetical speech translation systems, starting with a sues involved in comparing these approaches. I will
logic-based design and pointing out which of its char- attempt to argue that the important distinction is not
acteristics are best preserved or eliminated in moving so much a rational-empirical or symbolic-statistical dis-
to the second, quantitative design. The quantitative tinction but rather a qualitative-quantitative one. This
language and translation models are based on relations is followed by discussion of the logic-based model in
between lexical heads of phrases. Statistical parame- section 3, the overall quantitative model in section 4,
ters for structural dependency, lexical transfer, and lin- monolingual models in section 5, translation models
ear order are used to select a set of implicit relations in section 6, and some conclusions in section 7. We
between words in a source utterance, a corresponding concentrate throughout on what information about lan-
set of relations between target language words, and the guage and translation is coded and how it is express('d
most likely translation of the original utterance. as logical constraints or statistical parameters. Al-
though important, we will say little about search al-
1. Introduction gorithms, rule acquisition, or parameter estimation.
In recent years there has been a resurgence of interest in
statistical approaches to natural language processing.
2. Qualitative and Quantitative Models
Such approaches are not new, witness the statistical One contrast often taken for granted is the identifica-
approach to machine translation suggested by Weaver tion of a 'statistical-symbolic' distinction in language
(1955), but the current level of interest is largely due processing as an instance of the empirical vs. rational
to the success of applying hidden Markov models and debate. I believe this contrast has been exaggerated
N-gram language models in speech recognition. This though historically it has had some validity ill terms
success was directly measurable in terms of word recog- of accepted practice. Rule based approaches have be-
nition error rates, prompting language processing re- come more empirical in a number of ways: First, a more
searchers to seek corresponding improvements in per- empirical approach is being adopted to grammar devel-
formance and robustness. A speech translation system, opment whereby the rule set is modified according to
which by necessity combines speech and language tech- its performance against corpora of natural text (e.g.
nology, is a natural place to consider combining the sta- Taylor, Grovel and Briscoe 1989). Second, there is a
tistical and conventional approaches and much of this class of techniques for learning rules from text, a recent
paper describes probabilistic models of structural lan- example being Brill 1993. Conversely, it is possible to
guage analysis and translation. Our aim will be to pro- imagine building a language model in which all prob-
vide an overall model for translation with the best of abilities are estimated according to intuition w i t h o u t
both worlds. Various factors will lead us to conclude reference to any real data, giving a probabilistic mod~,l
that a lexicalist statistical model with dependency re- that is not empirical.
lations is well suited to this goal. Most language processing labeled as statistical in-
As well as this quantitative approach, we will consider volves associating real-number valued parameters to
a constraint/logic based approach and try to distinguish configurations of symbols. This is not surprising given
characteristics that we wish to preserve from those that that natural language, at least in written form, is explic-
are best replaced by statistical models. Although per- itly symbolic. Presumably, classifying a system as sym-
haps implicit in many conventional approaches to trans- bolic must refer to a different set of (internal) symbols,
lation, a characterization in logical terms of what is be- but even this does not rule out many statistical sys-
trrrls modeling events involving nonterminal categories translation, a very direct apprgach using parameters
and word senses. Given that the notion of a symbol, based on surface positions of words in source and target
let. alone an 'internal symbol', is itself a slippery one, it sentences was adopted in the Candide system (Brown
may he unwise to build our theories of language, or even et at. 1990). However, this does not capture important
tl,. way we classify different theories, on this notion. structural properties of natural language. Nor does it
Instead, it would seem that the real contrast driving take into account generalizations about translation that
the shift towards statistics in language processing is a are independent of the exact word order in source and
contrast between qualitative systems dealing exclusively target sentences. Such generalizations are, of course,
with combinatoric constraints, and quantitative systems central to qualitative structural approaches to transla-
that involve computing numerical functions. This bears tion (e.g. Isabelle and Macklovitch 1986, Alshawi et at.
dir~.ctly on the problems of brittleness and complexity 1992).
that discrete approaches to language processing share The aim of the quantitative language and translation
wll,ll, for example, reasoning systems based on tradi- models presented in sections 5 and 6 is to employ proba~
tional logical inference. It relates to the inadequacy of bilistic parameters that reflect linguistic structure with-
the dominant theories in linguistics to capture 'shades' out discarding rich lexical information or making the
of meaning or degrees of acceptability which are often models too complex to train automatically. In terms of
recognized by people outside the field as important in- a traditional classification, this would be seen as a 'hy-
herent properties of natural language. The qualitative- brid symbolic-statistical' system because it deals with
quantitative distinction can also be seen as underlying linguistic structure. From our perspective, it can be
the difference between classification systems based on seen as a quantitative version of the logic-based model
I'cature specifications, as used in unification formalisms because both models attempt to capture similar infor-
(Shicber 1986), and clustering based on a variable de- mation (about the organization of words into phrases
gr~,e of granularity (e.g. Pereira, Tishby and Lee 1993). and relations holding between these phrases or their ref-
It seems unlikely that these continuously variable as- erents), though the tools of modeling are substantially
pcct:s of fluent natural language can be captured by a different.
purely combinatoric model. This naturally leads to the
qtwstion of how best to introduce quantitative model- 3. Dissecting a Logic-Based System
i,g into language processing. It is not, of course, nec-
,,ssary for the quantities of a quantitative model to be We now consider a hypothetical speech translation sys-
probabilities. For example, we may wish to define real- tem in which the language processing components fol-
valued functions on parse trees that reflect the extent low a conventional qualitative transfer design. Al-
to which the trees conform to, .say, minimal attachment though hypothetical, this design and its components are
and parallelism between conjuncts. Such functions have similar to those used in existing database query (Rayner
been used in tandem with statistical functions in ex- and Alshawi 1992) and translation systems (Alshawi et
periments on disambiguation (for instance Alshawi and al 1992). More recent versions of these systems have
(',a.rter 1994). Another example is connection strengths been gradually taking on a more quantitative flavor,
i, m~ural network approaches to language processing, particularly with respect to choosing between alterna-
th,mgh it. has been shown that certain networks are tive analyses, but our hypothetical system will be more
~,tfectively computing probabilities (Richard and Lipp- purist in its qualitative approach.
mann 1991). The overall design is as follows. We assume that
Nevertheless, probability theory does offer a coher- a speech recognition subsystem delivers a list of text
ent and relatively well understood framework for select- strings corresponding to transcriptions of an input ut-
ing between uncertain alternatives, making it a natural terance. These recognition hypotheses are passed to a
choice for quantitative language processing. The case parser which applies a logic-based grammar and lexicon
f.r probability theory is strengthened by a well devel- to produce a set of logical forms, specifically formulas
,,p-d empirical methodology in the form of statistical in first order logic corresponding to possible interpreta-
I,:~ramet.ccr estimation. There is also the strong connec- tions of the utterance. The logical forms are filtered by
l i,,n between probability theory and the formal theory contextual and word-sense constraints, and one of them
.1" i.formation and communication, a connection that is passed to the translation component. The translation
has been exploited in speech recognition, for example relation is expressed by a set of first order axioms which
I~qing tim concept of entropy to provide a motivated way are used by a theorem prover to derive a target language
,.f measuring the complexity of a recognition problem logical form that is equivalent (in some context) to the
(.h'lim'k et ai. 1992). source logical form. A grammar for tile target language
I",v,'n if probability t|wory remains, as it currently is then applied to the target form, generating a syntax
is, th,, m~.l.llod of clloicc in making language processing tree whose fringe is passed to a speech synthesizer.
qu.ntitative, this still h~aw:s the fieht wide open in terms "Faking the various components in turn, we make a
.,f carving up languag~ processing into an appropriate note of undesirable properties that might be improved
set ,,f ,,wmts tbr probability theory to work with. For by quantitative modeling.

2
Analysis and Generation strings are often rejected; for large grammars we got a
A grammar, expressed as a set of syntactic rules (ax- vast number of alternative trees so the chance of seh'ct-
ioms) Gsv, and a set of semantic rules (axioms)Gsem is ing the correct tree for simple Nell{.CllCes C ; t l l gel. worso
used to support a relation form holding between strings ~Lg the gralnmar cow'rago increas,,s. '['hcre is also tl,.
s and logical forms ¢ expressed in first order logic: problem of requiring increasingly comph,x feature sets
to describe idiosyncrasies in the lexicon.
a.y. u a,.m f o m( s, ¢ ) . "
S e m a n t i c s Semantic grammar axioms belonging to
The relation form is many-to-many, associating a
Gsem specify a 'composition' function g for deriving a
string with linguistically possible logical form interpre-
logical form for a phrase from those for its subphrasos:
tations. In the analysis direction, we are given s and
search for logical forms ¢, while in generation we search
for strings s given ¢. form(so, g(¢t, ¢2))
For analysis and generation, we are treating strings daughters(so, st, s2)Acj (st)Ac2(s2)Acl~(s0)
s and logical forms ¢ as object level entities. In inter- A form(sl, el) A form(s2, ¢2)
pretation and translation, we will move down from this
meta-level reasoning to reasoning with the logical forms The interpretation rules for strings l)ottom out ill a set
as propositions. of lexical semantic rules associating words with pred-
The list of text strings handed by the recognize/to icates (pl,P2,...) corresponding to 'word senses'. For
the parser can be assumed to be ordered in accordance a particular word and syntactic category, there will bo
with some acoustic scoring scheme internal to the rec- a (small, possibly empty) finite set of such word sense
ognizer. The magnitude of the scores is ignored by our predicates:
qualitative language processor; it simply processes the el(w) ~ form(w,p~)
hypotheses one at a time until it finds one for which
it can produce a complete logical form interpretation cdiw) ~ form(w,pim).
that passes grammatical and interpretation constraints,
at which point it discards the remaining hypotheses. First order logic was assunmd as the semantic repre-
Clearly, discarding the acoustic score and taking the sentation language because it comes with well under-
first hypothesis that satisfies the constraints may lead stood, if not very practieM, inferential machinery for
to an interpretation that is less plausible than one deriv- constraint solving. However, applying this machinory
able from a hypothesis further down in the recognition requires making logical forms fine grained to a degroe
list. But there is no point in processing these later often not warranted by the information the speaker of
hypotheses since we will be forced, to select one inter- an utterance intended to convey. An example of this is
pretation essentially at random, explicit scoping which leads (again) to large numlmrs of
alternatives which the qualitative model has difliculty
choosing between. Also, many natural language sen-
S y n t a x The syntactic rules in Gsv. relate 'category' tences cannot be expressed in first order logic without
predicates co, ct, c2 holding of a string and two spanning resort to elaborate formulas requiring complex seman-
substrings (we limit the rules here to two daughters for tic composition rules. These rules can be simplilied by
simplicity): using a higher order logic but at the expense of cw.n
c0(s0) A daughters(so, sl, s2) less practical inferential machinery.
el(st) A cz(s2) A (so = concat(st, s2)) In applying the grammar in generation we are
faced with the problem of balancing over and under-
(Here, and subsequently, variables like so and st are generation by tweaking grammatical constraints, there
implicitly universally quantified.) G~v,~ also includes being no way to prefer fully grammatical target sen-
lexical axioms for particular strings w consisting of sin- tences over more marginal ones. Qualitative approaches
gle words: to grammar tend to emphasize the ability to capl, uro
generalizations as the main measure of success in lin-
el(w), ... guistic modeling. This might explain why producing
appropriate lexical collocations is rarely addressed seri-
For a feature-based grammar, these rules can in- ously in these models, even though lexical collocations
clude conjuncts constraining the values, a l , a ~ , . . . , of are important for fluent generation. '/'he study of col-
discrete-valued functions f on the strings: locations for generation fits in more naturally with sl.a-
f(w) = al, f(so) = f(St). tistical techniques, as illustrated by Smajda and McK-
eown (1990).
The main problem here is that such grammars have
no notion of a degree of grammatical acceptability - a Interpretation
sentence is either grammatical or ungrammatical. For In the logic-based model, interpretation is the process
small grammars this means that perfectly acceptable of identifying from the possible interpretations ~ of s for
3
which f o r m ( s , qt) hold, ones that are consistent with the p5(~1) ~ (p1(~1, z2) ~ ql(zl, z2)).
,',,m~,xt of interpretation. We can state this as follows:
The need for the assumptions A' arises when a source
language word is vaguer that its possible translations
/f U.~'U A ~ O. in the target language, so different choices of target
Ih.r,., we haw~ separated the context into a contingent words will correspond to translations under different
s,,I ,ff contextual propositions S and a set R of (mono- assumptions. For example, the condition p s ( x l ) above
li ngual) 'meaning postulates', or selectional restrictions, might be proved from the input logical form, or it might
that constrain the word sense predicates in all contexts. need to be assumed.
.1 is a set of assumptions sufficient to support the in- In the general case, finding solutions (i.e. A', ~bt pairs)
I,'rl)n'lation ¢ given S and R. In other words, this is for the abductive schema is an undecidable theorem
h,~crl)rctal, ion as abduction' (Itobbs et al. 1988), since proving problem. This can be alleviated by placing re-
~!)(i,('lion, not deduction, is needed to arrive at the strictions on the form of meaning postulates and input
:~>.'d H II I~tiOIIS ,4. formulas and using heuristic search methods. Although
'l'h(" ,host common types of meaning postulates in R such an approach was applied with some success in
art, t h,,s~" for restriction, hyponymy, and disjointness, a limited-domain system translating logical forms into
, \l,l'<:.~sed a.'~ f o l l o w s : database queries (Rayner and Alshawi 1992), it is likely
to be impractical for language translation with tens of
HI (.l'l, X2) ~ p 2 ( x ! ) restriction; thousands of sense predicates and related axioms.
t,:¢(x) --* p 3 ( x ) hyponymy; Setting aside the intractability issue, this approach
-~(pa(x) A p4(x)) disjointness. does not offer a principled way of choosing between al-
Although there are compilation techniques (e.g. Mel- ternative solutions proposed by the prover. One would
lish 1 9 ~ ) which allow sclectional constraints stated in like to prefer solutions with 'minimal' sets of assump-
this fashion to be implemented efficiently, the scheme tions, but it is difficult to find motivated definitions for
i~ I,rol)lematic iu other respects. To start with, the as- this minimization in a purely qualitative framework.
s~t~ttl~l ion of a small set of senses for a word is at best
;~wkward because it is difficult to arrive at an optimal 4. Q u a n t i t a t i v e Model Components
gra,ularity for sense distinctions. Disambiguation with Moving to a Quantitative Model
s,qcctionai restrictions expressed as meaning postulates
is also prol)lematic because it is virtually impossible to In m o v i n g to a quantitative architecture, we propose to
,levis, a set of postulates that will always filter all but retain many of the basic characteristics of the qualita-
,,t,, alt.crnative. We are thus forced to under-filter and tive model:
make an arbitrary choice between remaining alterna- • A transfer organization with analysis, transfer, and
tives. generation components.
Logic based translation • Monolingual models that can be used for both anal-
ysis and generation.
In hoth the quantitative and qualitative models we take
a t ransfi~r approach to translation. We do not depend • Translation models that exclusively code contrastive
.!~ im.('rlingual symbols, but instead map a representa- (cross-linguistic) information.
I i,:)n with constants associated with the source language • Hierarchical phrases capturing recursive linguistic
inlx) a corresponding expression with constants from the structure.
l ar~ct language. For the qualitative model, the opera-
hh, notion of correspondence is based on logical equiva- Instead of feature based syntax trees and first-order
hql('e and the constants are source word sense predicates logical forms we will adopt a simpler, monostratal rep-
I'1, t"-' . . . . and target sense predicates ql, q2, . . . .
resentation that is more closely related to those found
More specifically, we will say the translation relation in dependency grammars (e.g. Hudson 1984). Depen-
hH we~,n a source logical form Cs and a target logical dency representations have been used in large scale
i;,r~t 6t holds if we have qualitative machine translation systems, notably by
McCord (1988). The notion of a lexical 'head' of a
/~ u .'~'u A' ~ (q~., ~ ~,) phrase is central to these representations because they
wh,.n, I~ is a s~.t of monolingual and bilingual mean-
concentrate on relations between such lexical heads. In
I J;:. i,t).~l.ulal.es, a n d ,S' is ;t s e t o f f o r m u l a s characterizing
our case, the dependency representation is monostratal
I.h*' ~'lli'l','llt COllt~xt. .'l I is a s,,t of assumptions that in that the relations may include ones normally classi-
fied as belonging to syntax, semantics or l)ragmatics.
in,h=,h's I.h,' assunlptions A which SUl)ported ~bs. ilere
I,ili,,~ual me;ruing i~osl.ulal.~.s a.re first order axioms re- One salient property of our language model is that it
hll.ing source and target sense predicates. A typical is strongly lexical: it consists of statistical parameters
I,ilin~ual posl.ulate Ibr translal.ing between Pl an(I ql associated with relations between lexical items and the
ii~it~;lil h,, of th,. for,n: number and ordering of dependents of lexical heads.
This lexical anchoring facilitates statistical training and
4
sensitivity to lexical variation and collocations. In order For tighter integration between getmraliovt aml sy,,tl,~',
to gain the benefits of probabilistic modeling, we replace sis, information about the derivation of I.Iw l,arg,'l uI
the task of developing large rule sets with the task of I,erance can also I)c passed to the syuthesizcr.
estimating large numbers of statistical parameters for
the monolingual and translation models. This gives rise Integrated Statistical Model
to a new cost trade-off in human annotation/judgement The probabilities associated with phrases in the abov,,
versus barely tractable fully automatic training. It also description are computed according to the statistical
necessitates further research on lexical similarity and models for analysis, translation, and generation. In this
clustering (e.g. Pereira, Tishby and Lee 1993, Dagan, section we show the relationship between these mod-
Marcus and Markovitch 1993) to improve parameter els to arrive at an overall statistical model of sp,,,.,"h
estimation from sparse data. translation. We are not considering training ismws in
this paper, though a number of now familiar techniques
Translation via Lexical Relation Graphs ranging from methods for maximum likelihood estima-
The model associates phrases with relation graphs. A tion to direct estimation using fully annotated data are
relation graph is a directed labeled graph consisting of applicable.
a set of relation edges. Each edge has the form of an The objects involved in the overall model are as Jbl-
atomic proposition lows (we omit target speech synthesis under the, as-
sumption that it proceeds deterministically from a tar-
~(wi,w~) get language word string):
• A0: (acoustic evidence for) source language spe~'ch
where r is a relation symbol, wi is the lexical head of
a phrase and wj is the lexical head of another phrase • Wo: source language word string
(typically a subphrase of the phrase headed by w~). The • Wz: target language word string
nodes wi and wj are word occurrences representable by • C0: source language relation graph
a word and an index, the indices uniquely identifying
particular occurrences of the words in a discourse or • Ct: target language relation graph
corpus. The set of relation symbols is open ended, but Given a spoken input in the source language, we
the first argument of the relation is always interpreted wish to find a target language string that is the most
as the head and the second as the dependent with re- likely translation of the input. We are thus interestc.d
spect to this relation. The relations in the models for in the conditional probability of We given A,. This
the sour~:e and target languages need not be the same, conditional probability can be expressed as follows (of.
or even overlap. To keep the language models simple, Chang and Su 1993):
we will mainly restrict ourselves here to dependency
graphs that are trees with unordered siblings. In partic- P(WdA,) =
ular, phrases will always be contiguous strings of words
and dependents will always be heads of subphrases. ~W,,C,,Ct P(WolAo) P(C, IW,, A,)
Ignoring algorithmic issues relating to compactly rep- P(CdCo, W,, A°) PCWd(:,, C,, W.,, ,4, ).
resenting and efficiently searching the space of alterna-
tive hypotheses, the overall design of the quantitative
system is as follows. The speech recognizer produces We now apply some simplifying independence .s-
a set of word-position hypotheses (perhaps in the form sumptions concerning relation graphs. Specifically. that
of a word lattice) corresponding to a set of string hy- their derivation from word strings is independent of
potheses for the input. The source language model is acoustic information; that their translation is indepen-
used to compute a set of possible relation graphs, with dent of the original words and acoustics involved; and
associated probabilities, for each string hypothesis. A that target word string generation from target relation
probabilistic graph translation model then provides, for edges is independent of the source language represent, a-
each source relation graph, the probabilities of deriving tions. The extent to which these (Markovian) assump-
corresponding graphs with word occurrences from the tions hold depend on the extent to which relation edges
target language. These target graphs include all the represent all the relevant information for translation.
words of possible translations of the utterance hypothe- In particular it means they should express aspects of
ses but do not specify the surface order of these words. surface relevant to meaning, such as topicalization, as
Probabilities for different possible word orderings are well as predicate argument structure. In any case, the
computed according to ordering parameters which form simplifying assumptions give the following:
part of the target language model.
In the following section we explain how the probabil- P(W~IA, ) _~
ities for these various processing stages are combined to ~w.,c.,c, P( W, IA, ) P(C01W,)P( Ct lCo ) P( Wt I£:, ).
select the most likely target word sequence. This word
sequence can then be handed to the speech synthesizer. This can be rewritten with two applications of Bay,,~
5
I'llh': where C ranges over relation graphs. The content
model, P(C), and generation model, P(WIC), are com-
v"
L.,W.. ,C~,('t P( A, IW,) ( I / P(A.,)) P(WolC,) ponents of the overall statistical model for spoken lan-
guage translation given earlier. This decomposition of
P(C,) P(C~IC, ) P(W, ICt). P(W) can be viewed as first deciding on the content of
a sentence, formulated as a set of relation edges accord-
ing to a statistical model for P(C), and then deciding
Since A, is given, lIP(A,) is a constant which can be on word order according to P(WIC ).
ignored in finding the maximum of P(Wt]As). Deter- Of course, this decomposition simplifies the realities
mining Wt that maximizes P(WdA, ) therefore involves of language production in that real language is always
the following factors: generated in the context of some situation S (real or
* I'(A, IW, ): source language acoustics imaginary), so a more comprehensive model would be
• /'([.V, IC,): source language generation concerned with P(CIS), i.e. language production in
context. This is less important, however, in the trans-
. I'(C.,): source content relations lation setting since we produce Ct in the context of a
• /'(('tiCs): source to target transfer source relation graph C, and we assume the availability
of a model for P(CtlC,).
• I'(IVtlC't ): target language generation
Wc a.,~ume that the speech recognizer provides acous- Content Derivation M o d e l
tic scores proportional to P(A, IW, ) (or logs thereof). The model for deriving the relation graph of a phrase
Sud~ scores are normally computed by speech recogni- is taken to consist of choosing a lexical head h0 for the
l i,,n systems, although they are usually also multiplied phrase (what the phrase is 'about') followed by a series
by w,,rd-based language model probabilities P(W,) of 'node expansion' steps. An expansion step takes a
which we do not require in this application context. node and chooses a possibly empty set of edges (relation
()ur approach to language modeling, which covers the labels and ending nodes) starting from that node. Here
corn.cat analysis and language generation factors, is pre- we consider only the case of relation graphs that are
:~,,uted in section 5 and the transfer probabilities fall trees with unordered siblings.
umh,r the translation model of section 6. To start with, let us take the simplified case where a
Finally note thai. by another application of Bayes head word h has no optional or duplicated dependents
,-,d,, w,, can replace the two factors P(C,)P(CdC,) by (i.e. exactly one for each relation). There will be a set
I'(Ct)l'(C, lCt} without changing other parts of the of edges
model. Tiffs latter fornmlation allows us to apply con-
straints imposed by the target language model to ill- E(h) = {rl(h, wl), r~(h, w2) ... r~(h, wk)}
t,'r inappropriate possibilities suggested by analysis and
tra.sfi~r. In some respects this is similar to Dagan and corresponding to the local tree rooted at h with depen-
Itai's (I 994) approach to word sense disambiguation us- dent nodes Wl...wk. The set of relation edges for the
ing statistical associations in a second language. entire derivation is the union of these local edge sets.
To determine the probability of deriving a relation
5. L a n g u a g e Models graph C for a phrase headed by h0 we make use of
Language Production Model parameters ('dependency parameters')
~).r bmguage model can be viewed in terms of a proba- P(r(h,w)lh, r)
bihstic generative process based on the choice of lexical
"heads" of phrases and the recursive generation of sub- for the probability, given a node h and a relation r,
;,bra~es and their ordering. For this purpose, we can de- that w is an r-dependent of h. Under the assumption
(ira, tho head word of a phrase to be the word that most that the dependents of a head are chosen independently
strongly influences the way the phrase may be com- from each other, the probability of deriving C is:
biucd with other phrases. This notion has been central
to a number of approaches to grammar for some time, P(C) = P(Top(ho)) I~Ir(h.~)¢c P(r(h, w)lh, r)
including theories like dependency grammar (Hudson
I!~7(;, 1990) and HPSG (Pollard and Sag 1987). More where P(Top(ho)) is the probability of choosing h0 to
;,'~,.t,l.ly, the statistical properties of associations be- start the derivation.
Iw,.,'n words, and more particularly heads of phrases, If we now remove the assumption made earlier that
JL:t.~J~,~'~,l|lql,all a.el.iw; area of research (e.g. Chang, l,uo, there is exactly one r-dependent of a head, we need to
aml Su 1992; Ilindlc and R.ooth 1993). elaborate the derivation model to include choosing the
'l'h,' language model factors the statistical derivation number of such dependents. We model this by param-
,,f a .~'ul.ence with word string W as follows: eters

I'(ll) = ~,: P(C) P(WIC) P(N(r,n)lh)


6
that is, the I)rol)aifility that head h h+~ n r-dep(m(lents. s>(WlC) I-Ih~w(Il.
= ~-~--~ ) l'(.sw <"h I M ( ~'w < "h ))
We will r,ffer t,o t,|lis I)robability ;m a '(let, all parameter'.
Our previous assmnption amounted to stating that this where It ranges over all the heads in (;, aud m. is I.h<'
was always 1 for n = 1 or for n = 0. Detail parameters number of occurrences of r in sW(:h, assuming that all
allow us to model, for example, the number of adjectival orderings of nr-dependents are equally likely. We can
modifiers of a noun or the 'degree' to which a particular thus use these sequencing parameters directly in our
argument of a verb is optional. The probability of an overall model.
expansion of h giving rise to local edges E(h) is now: To summarize, our monolingual models are specifi,'d
by:
P(E(h)lh) = * topmost head parameters P(Top(h))
Fir P(N(r, nr)lh) k(nr) I]l<i<r~ P(r(h, w[)lh , r). * dependency parameters P(r(h, w)lh, r)
where r ranges over the set of relation labels and h has + detail parameters P(N(r, n)lh )
nr r-dependents w~... wnP. k(nr) is a combinatorie con- * sequencing parameters P(s[M(s))
stant for taking account of the fact that we are not dis-
The overall model splits the contributions of ('ollt~mt
tinguishing permutations of the dependents (e.g. there
are n,.! permutations of the r-dependents of h if these
P(C) and ordering P(WIC ). However, we may also
want a model for P ( W ) , for example for pruning spec(:h
dependents are all distinct).
recognition hypotheses. Combining our content ;rod or-
So if h0 is the root of a tree C, we have
dering models we get:
P(C) = P(Top(ho)) rIheh~aa,(c) P(Ec(h)lh)
P(W) = Z P(C) P(WIC)
where heads(C) is thc set of nodes in C and Ec(h) is c
the set of edges headed by h in C.
The above formulation is only an approximation for = ~C P(Top(hc)) H P(swc'hlh)
relation graphs that are not trees because the indepen- hEW
dence assumptions which allow the dependency param- H P(r(h, w)lh, ,')
eters to be simply multiplied together no longer hold r(h,w)eE¢(h)
for the general case. Dependency graphs with cycles do
arise as the most natural analyses of certain linguistic The parameters P(slh ) can be derived by combining
constructions, but calculating their probabilities on a sequencing parameters with the detail parameters for
node by node basis as above may still provide proba- h.
bility estimates that are accurate enough for practical
purposes. 6. T r a n s l a t i o n Model
Generation Model Mapping Relation Graphs
We now return to the generation model P(WIC). As As already mentioned, the translation model delines
mentioned earlier, since C includes the words in W and mappings between relation graphs C., for the source
a set of relations between them, the generation model language and Ct for the target language. A direct
is concerned only with surface order. One possibility is (though incomplete)justification of translation via n.-
to use 'bi-relation' parameters for the probability that lation graphs may be based on a simple referential view
an ri-dependent immediately follows an u-dependent. of natural language semantics. Thus nominals and
This approach is problematic for oui: overall statisti- their modifiers pick out entities in a (real or imagi-
cal model because such parameters are not independent nary) world, verbs and their modifiers refer to actions
from the 'detail' parameters specifying the number of or events in which the entities participate in roles in-
r-dependents of a head. dicated by the edge relations. Under this view, the
We therefore adopt the use of 'sequencing' parame- purpose of the translation mapping is to determhm a
ters, these being probabilities of particular orderings of target language relation graph that provides the best
dependents given that the multiset of dependency rela- approximation to the referential function induced by
tions is known. We let the identity relation e stand for the source relation graph. We call this approximating
the head itself. Specifically, we have parameters referential equivalence.
This referential view of semantics is not adequate for
P(slM(s)) taking account of much of the complexity of natural
language including many aspects of quantification, dis-
where s is a sequence of relation labels including an oc-
tributivity and modality. This means it cannot capture
currence of e and M(s) is the multiset for this sequence.
some of the subtleties that a theory based on logical
For a head h in a relation graph C, let swch be the se-
equivalence might be expected to. On the other hand,
quence of dependent relations induced by a particular
when we proposed a logic based approach as our quali-
word string W generated from C. We now have
tative model, we had to restrict it to a simple first order
7
logic anyway for computational reasons, and even then That is, the probability that .I' maps exactly the (possi-
it did not appear to be practical. Thus using the more bly empty) subset {vi*... v~} of Nt to wi. These sets are
impow~rished lexical relations representation may not assumed to be disjoint for different source graph nodes,
tw costing us much in practice. so we can replace the factors in the above product with
One aspect of the representation that is particularly parameters:
useful in the translation application is its convenience
for partial and/or incremental representation of content P(MIw)
we can refine the representation by the addition of fur- where w is a source language word and M is a multiset
thor edges. A fully specified denotation of the meaning
of target language words.
of a s,mtence is rarely required for translation, and as
We will derive a target set of edges Et of Ct by k
w,~ pointed out when discussing logic representations, a derivation steps which partition the set of source edges
c~mq~lete specification may not have been intended by E, into subgraphs St ... Sk. These subgraphs give rise
th,, slwaker. Although we have not provided a denota- to disjoint sets of relation edges T1 ... Tk which together
tio.al semantics for sets of relation edges, we anticipate form Et. The structural component of our translation
that this will be possible along the lines developed in
model will be the sum of derivation probabilities for
m(motonic semantics (Alshawi and Crouch 1992).
such an edge set Et.
Translation Parameters For simplicity, we assume here that the source graph
C, is a tree. This is consistent with our earlier assump-
'1'o bc practical, a model for P(CtIC,) needs to decom- tions about the source language model. We take our
pose the source and target graphs C~ and Ct into sub-
partitions of the source graph to be the edge sets for
graphs small enough that subgraph translation parame-
local trees. This ensures that the the partitioning is
ters can be estimated. We do this with the help of 'node deterministic so the probability of a derivation is the
a.lignment relations' between the nodes of these graphs.
product of the probabilities of derivation steps. More
'l'lmse alignment relations are similar in some respects
complex models with larger partitions rooted at a node
to the alignments used by Brown et al. (1990) in their are possible but these require additional parameters for
surface translation model. The translation probability
partitioning. For the simple model it remains to specify
is then the sum of probabilities over different alignments
derivation step probabilities.
.t: The probability of a derivation step is given by pa-
I'(C, ICo) = ~ s P(C. flC,). rameters of the form:
There are different ways to model P(Ct,.tIC,) corre- P(T,qS,', .td
sp(mding to different kinds of alignment relations and
different independence assumptions about the transla- where S~ and T[ are unlabeled graphs and ffi is a node
tion mapping. alignment function from T[ to S~. Unlabeled graphs
l"or our quantitative design, we adopt a simple model are just like our relation edge graphs except that the
in which lexical and relation (structural) probabilities nodes are not labeled with words (the edges still have
are assumed to be independent. In this model the align- relation labels). To apply a derivation step we need a
nlent relations are functions from the word occurrence notion of graph matching that respects edge labels: g is
~lodes of Ct to the word occurrences of C~. The idea an isomorphism (modulo node labels) from a graph G
is that .t(,j) = wi means that the source word occur- to a graph H if g is a one-one and onto function from
r('ncc wi 'gave rise' to the target word occurrence vj. the nodes of G to the nodes of H such that
'l'lw inverse relation .t-1 need not be a function, allow-
ing different numbers of words in the source and target
r(a, b) e V iff r(g(a), g(b)) • H.
sentences. The derivation step with parameter P(T[IS~,f~ ) is
We decompose P(C~,.tIC,) into 'lexical' and 'struc- applicable to the source edges St, under the alignment
tural' probabilities as follows: f, giving rise to the target edges Ti if (i) there is an iso-
I'(Ct, f i e , ) = P(N,, IIN,)P(EtINt, .t, C,) morphism hi from S~ to Si (ii) there is an isomorphism
gi from ~ to T~' (iii) for any node v of Ti it must be the
where Nt and N, are the node sets for Ct and C0 respec- case that
tiw.ly, and Et is the set of edges for the target graph.
The lirst factor P(Nt, .fiN,) is the lexical component hi(fi(gi(v))) -- f(v).
it~ ~.hat it does not take into account any of the relations
in I.he source graph C.,. This lexical component is the This last condition ensures that the target graph parti-
pro,luct of alignment probabilities for each node of N,: tions join up in a way that is compatible with the node
alignrn,:nt f,
Tile factoring of the translation model into these
PCN,, fiN, ) = lexical and structural components means that it will
"?}lwd. overgenerate because these aspects are not indepen-
H
wiEN. dent in translation between real natural languages. It
8
is therefore appropriate to filter translation hypotheses * quantitative preference based on probabilistic deriva-
by re.scoring according to the version of the overall sta- tion and translation;
tistical model that included the factors P(Ct)P(ColCt)
so that the target language model constrains the out- • incremental and/or partial speeilication of tlw ~',~tl-
put of the translation model. Of course, in this case we tent of utterances, particularly useful in I,ranslatiou;
need to model the translation relation in the 'reverse' • decomposition of complex utterances through rccur-
direction. This can be done in a parallel fashion to the sive linguistic structure.
forward direction described above.
These factors suggest that dependency grammar will
play an increasingly important role as language pro-
7. Conclusions cessing systems seek to combine both structural and
Our qualitative and quantitative models have a similar colloeational information.
overall structure and there are clear parallels between
the factoring of logical constraints and statisticalpa- Acknowledgements
rameters, for example monolingual postulates and de-
pendency parameters, bilingual postulates and trans- I am grateful to Fernando Pereira, Mike Riley, and hlo
lation parameters. The parallelism would have been Dagan for valuable discussions on the issues addressed
closer if we had adopted ID/LP style rules (Gazdar et in this paper. Fernando Pereira and !do Dagan also
al. 1985) in the qualitative model. However, we argued provided helpful comments on a draft of the paper.
in section 3 that our qualitative model suffered from
lack of robustness, from having only the crudest means References
for choosing between competing hypotheses, and from
being computationally intractable for large vocabular- Alshawi, H., D. Carter, B. Gamback and M. Rayner.
ies. 1992. "Swedish-English QLF Translation". In H. AI-
The quantitative model is in a much better position shawl (ed.) The Core Language Engine, Cambridge,
Mass.: MIT Press.
to cope with these problems. It is less brittle because
statistical associations have replaced constraints (feat- Alshawi, H. and R. Crouch. 1992. "Monotonic Seman-
ural, selectional, etc.) that must be satisfied exactly. tic Interpretation". Proceedings of the 30th Annual
The probabilistic models give us a systematic and well Meeting of the Association for Computational Lin-
motivated way of ranking alternative hypotheses. Com- guistics, Newark, Delaware.
putationally, the quantitative model letsus escape from
the undecidability of logic-based reasoning. Because Alshawi, H. and D. Carter. 1994. "Training and Seal-
this model is highly lexical, we can hope that the in- ing Preference Functions for Disambiguation". To
put words will allow effective pruning by limiting the appear in Computational Linguistics.
number of search paths having significantlyhigh prob- Brill, E. 1993. "Automatic Grammar Induction and
abilities. Parsing Free Text: A Transformation-Based Ap-
W e retained some of the basic assumptions about the proach".Proceedings of the 31st Annual Meeting of
structure of language when moving to the quantitative the Association for Computational Linguistics, 259
model. In particular, we preserved the notion of hierar- 265.
chical phrase structure. Relations motivated by depen-
dency grammar made it possible to do this without giv- Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F.
ing up sensitivityto lexicalcollocations which underpin Jelinek, J. Lafferty, R. Mercer and P. Rossin. 1990.
simple statisticalmodels like N-grams. The quantita- "A Statistical Approach to Machine TranslatioJl".
tive model also reduced overall complexity in terms of Computational Linguistics 16:79-85.
the sets of symbols used. In addition to words, it only Chang, J., Y. Lua, and K. Su. 1992. "GPSM: A Gen-
required symbols for dependency relations,whereas the eralized Probabilistic Semantic Model for Ambiguity
qualitative model required symbol sets for linguistic Resolution". Proceedings of the 30th Annual Meet-
categories and features, and a set of word sense sym- ing of the Association for Computational Linguistics,
bols. Despite their apparent importance to translation, 177-192.
the quantitative system can avoid the use of word sense
symbols (and the problems of granularity they give rise Chang, J., K. Su. 1993. "A Corpus-Based Statistics-
to) by exploiting statisticalassociations between words Oriented Transfer and Generation Model for Machine
in the target language to filterimplicit sense choices. Translation". Proceedings of the 5th International
Finally, here is a summary of our reasons for com- Conference on Theoretical and Methodological Issues
bining statisticalmethods with dependency representa- in Machine Translation.
tions in our language and translation models: Dagan I. and A. Itai. 1994. "Word Sense Disambigua-
• inherent lexicalsensitivityof dependency representa- tion Using a Second Language Monolingual Corpus".
tions, facilitatingparameter estimation; To appear in Computational Linguistics.

9
l)agan, 1., S. Marcus and S. Markoviteh. 1993. "Con- Smajda, F. and K. McKeown. 1990. "Automatically
textual Word Similarity and Estimation from Sparse Extracting and Representing Collocations for Lan-
Data". Proceedings of the 31st meeting of the Associ- guage Generation". In Proceedings of the $Sth An-
atio~ for Computational Linouistics, ACL, 164-171. nual Meeting of the Association for Computational
Linguistics, Pittsburgh.
(;azdar, G., E. Klein, G.K. Pullum, and I.A.Sag. 1985.
Generalised Phrase Structure Grammar. Oxford: Taylor, L., C. Grover, sad E.J. Briscoe. 1989. "The
Blackwell. Syntactic Regularity of English Noun Phrases". Pro-
ceeedings of the 4th European ACL Conference, 256-
Ilindle, D. and M. Rooth. 1993. "Structural Ambiguity 263.
and Lexical Relations". Computational Linguistics
19:103-120. Weaver, W. 1955. "Translation". In W. Locke and
A. Booth (eds.), Machine Translation of Languages,
Ilobbs, J.R., M. Stickel, P. Martin and D. Edwards. Cambridge, Mass.: MIT Press.
1988. "Interpretation as Abduction", Proceedings of
the 26th Annual Meeting of the Association for Com-
putational Linguistics, Buffalo, New York, 95-103.
Itudson, R.A. 1984. Word Grammar. Oxford: Black-
woll.
Isah~,llo, P. and E. Macklovitch. 1986. "Transfer and
MT Modularity", Eleventh International Conference
on Computational Linguistics, Bonn, 115-117.
.Iclinek, F., R.L. Mercer and S. Roukos. 1992. "Princi-
ples of Lexieal Language Modeling for Speech Recog-
nition". In S. Furui and M.M. Sondhi (eds.), Ad-
vances in Speech Signal Processing, New York: Mar-
col Dekker Inc.
M,:llish, C.S. 1988. "Implementing Systemic Classifi-
cation by Unification". Computational Linguistics
14:40-51.
McCord, M. 1988. "A Multi-Target Machine Trans-
lation System". Proceedings of the International
Conference on Fifth Generation Computer Systems,
'[bkyo, Japan, 1141-1149.
I'ereira, F., N. Tishby and L. Lee. 1993. "Distri-
butional Clustering of English Words". Proceedings
of the 31st meeting of the Association for Computa-
tzonal Linguistics, ACL, 183-190.
I'oll~trd, C.J. and I.A. Sag. 1987. Information Based
,';yntax and Semantics: Volume I ~ Fundamentals.
CSI,I Lecture Notes, Number 13. Center for the
Study of Language and Information, Stanford, Cali-
fornia.
llayner, M. and H. Alshawi. 1992. "Deriving Database
Queries from Logical Forms by Abductive Definition
Expansion". Proceedings of the Third Conference on
Applied Natural Language Processing, Trento, Italy.
i{ichard, M.D. and R.P. Lippmann. 1991. "Neural Net-
work Classiliers Estimate Bayesian a posteriori Prob-
;d,ilili~.s". Neural Computation 3:461-483.
Shi,.l~,,r, S.M. 1986. An Introduction to Unification-
Ilascd Approaches to Grammar. CSLI Lecture Notes,
Number 4. Center for the Study of Language and
I i~form ation, Stanford, California.
10

You might also like