Qualitative and Quantitative Models of Speech Translation
Qualitative and Quantitative Models of Speech Translation
of Speech Translation
Hiyan Alshawi
AT,~T Bell Laboratories
600 Mountain Avenue
Murray Hill, NJ 07974, USA
hiyan@research, at t.com
2
Analysis and Generation strings are often rejected; for large grammars we got a
A grammar, expressed as a set of syntactic rules (ax- vast number of alternative trees so the chance of seh'ct-
ioms) Gsv, and a set of semantic rules (axioms)Gsem is ing the correct tree for simple Nell{.CllCes C ; t l l gel. worso
used to support a relation form holding between strings ~Lg the gralnmar cow'rago increas,,s. '['hcre is also tl,.
s and logical forms ¢ expressed in first order logic: problem of requiring increasingly comph,x feature sets
to describe idiosyncrasies in the lexicon.
a.y. u a,.m f o m( s, ¢ ) . "
S e m a n t i c s Semantic grammar axioms belonging to
The relation form is many-to-many, associating a
Gsem specify a 'composition' function g for deriving a
string with linguistically possible logical form interpre-
logical form for a phrase from those for its subphrasos:
tations. In the analysis direction, we are given s and
search for logical forms ¢, while in generation we search
for strings s given ¢. form(so, g(¢t, ¢2))
For analysis and generation, we are treating strings daughters(so, st, s2)Acj (st)Ac2(s2)Acl~(s0)
s and logical forms ¢ as object level entities. In inter- A form(sl, el) A form(s2, ¢2)
pretation and translation, we will move down from this
meta-level reasoning to reasoning with the logical forms The interpretation rules for strings l)ottom out ill a set
as propositions. of lexical semantic rules associating words with pred-
The list of text strings handed by the recognize/to icates (pl,P2,...) corresponding to 'word senses'. For
the parser can be assumed to be ordered in accordance a particular word and syntactic category, there will bo
with some acoustic scoring scheme internal to the rec- a (small, possibly empty) finite set of such word sense
ognizer. The magnitude of the scores is ignored by our predicates:
qualitative language processor; it simply processes the el(w) ~ form(w,p~)
hypotheses one at a time until it finds one for which
it can produce a complete logical form interpretation cdiw) ~ form(w,pim).
that passes grammatical and interpretation constraints,
at which point it discards the remaining hypotheses. First order logic was assunmd as the semantic repre-
Clearly, discarding the acoustic score and taking the sentation language because it comes with well under-
first hypothesis that satisfies the constraints may lead stood, if not very practieM, inferential machinery for
to an interpretation that is less plausible than one deriv- constraint solving. However, applying this machinory
able from a hypothesis further down in the recognition requires making logical forms fine grained to a degroe
list. But there is no point in processing these later often not warranted by the information the speaker of
hypotheses since we will be forced, to select one inter- an utterance intended to convey. An example of this is
pretation essentially at random, explicit scoping which leads (again) to large numlmrs of
alternatives which the qualitative model has difliculty
choosing between. Also, many natural language sen-
S y n t a x The syntactic rules in Gsv. relate 'category' tences cannot be expressed in first order logic without
predicates co, ct, c2 holding of a string and two spanning resort to elaborate formulas requiring complex seman-
substrings (we limit the rules here to two daughters for tic composition rules. These rules can be simplilied by
simplicity): using a higher order logic but at the expense of cw.n
c0(s0) A daughters(so, sl, s2) less practical inferential machinery.
el(st) A cz(s2) A (so = concat(st, s2)) In applying the grammar in generation we are
faced with the problem of balancing over and under-
(Here, and subsequently, variables like so and st are generation by tweaking grammatical constraints, there
implicitly universally quantified.) G~v,~ also includes being no way to prefer fully grammatical target sen-
lexical axioms for particular strings w consisting of sin- tences over more marginal ones. Qualitative approaches
gle words: to grammar tend to emphasize the ability to capl, uro
generalizations as the main measure of success in lin-
el(w), ... guistic modeling. This might explain why producing
appropriate lexical collocations is rarely addressed seri-
For a feature-based grammar, these rules can in- ously in these models, even though lexical collocations
clude conjuncts constraining the values, a l , a ~ , . . . , of are important for fluent generation. '/'he study of col-
discrete-valued functions f on the strings: locations for generation fits in more naturally with sl.a-
f(w) = al, f(so) = f(St). tistical techniques, as illustrated by Smajda and McK-
eown (1990).
The main problem here is that such grammars have
no notion of a degree of grammatical acceptability - a Interpretation
sentence is either grammatical or ungrammatical. For In the logic-based model, interpretation is the process
small grammars this means that perfectly acceptable of identifying from the possible interpretations ~ of s for
3
which f o r m ( s , qt) hold, ones that are consistent with the p5(~1) ~ (p1(~1, z2) ~ ql(zl, z2)).
,',,m~,xt of interpretation. We can state this as follows:
The need for the assumptions A' arises when a source
language word is vaguer that its possible translations
/f U.~'U A ~ O. in the target language, so different choices of target
Ih.r,., we haw~ separated the context into a contingent words will correspond to translations under different
s,,I ,ff contextual propositions S and a set R of (mono- assumptions. For example, the condition p s ( x l ) above
li ngual) 'meaning postulates', or selectional restrictions, might be proved from the input logical form, or it might
that constrain the word sense predicates in all contexts. need to be assumed.
.1 is a set of assumptions sufficient to support the in- In the general case, finding solutions (i.e. A', ~bt pairs)
I,'rl)n'lation ¢ given S and R. In other words, this is for the abductive schema is an undecidable theorem
h,~crl)rctal, ion as abduction' (Itobbs et al. 1988), since proving problem. This can be alleviated by placing re-
~!)(i,('lion, not deduction, is needed to arrive at the strictions on the form of meaning postulates and input
:~>.'d H II I~tiOIIS ,4. formulas and using heuristic search methods. Although
'l'h(" ,host common types of meaning postulates in R such an approach was applied with some success in
art, t h,,s~" for restriction, hyponymy, and disjointness, a limited-domain system translating logical forms into
, \l,l'<:.~sed a.'~ f o l l o w s : database queries (Rayner and Alshawi 1992), it is likely
to be impractical for language translation with tens of
HI (.l'l, X2) ~ p 2 ( x ! ) restriction; thousands of sense predicates and related axioms.
t,:¢(x) --* p 3 ( x ) hyponymy; Setting aside the intractability issue, this approach
-~(pa(x) A p4(x)) disjointness. does not offer a principled way of choosing between al-
Although there are compilation techniques (e.g. Mel- ternative solutions proposed by the prover. One would
lish 1 9 ~ ) which allow sclectional constraints stated in like to prefer solutions with 'minimal' sets of assump-
this fashion to be implemented efficiently, the scheme tions, but it is difficult to find motivated definitions for
i~ I,rol)lematic iu other respects. To start with, the as- this minimization in a purely qualitative framework.
s~t~ttl~l ion of a small set of senses for a word is at best
;~wkward because it is difficult to arrive at an optimal 4. Q u a n t i t a t i v e Model Components
gra,ularity for sense distinctions. Disambiguation with Moving to a Quantitative Model
s,qcctionai restrictions expressed as meaning postulates
is also prol)lematic because it is virtually impossible to In m o v i n g to a quantitative architecture, we propose to
,levis, a set of postulates that will always filter all but retain many of the basic characteristics of the qualita-
,,t,, alt.crnative. We are thus forced to under-filter and tive model:
make an arbitrary choice between remaining alterna- • A transfer organization with analysis, transfer, and
tives. generation components.
Logic based translation • Monolingual models that can be used for both anal-
ysis and generation.
In hoth the quantitative and qualitative models we take
a t ransfi~r approach to translation. We do not depend • Translation models that exclusively code contrastive
.!~ im.('rlingual symbols, but instead map a representa- (cross-linguistic) information.
I i,:)n with constants associated with the source language • Hierarchical phrases capturing recursive linguistic
inlx) a corresponding expression with constants from the structure.
l ar~ct language. For the qualitative model, the opera-
hh, notion of correspondence is based on logical equiva- Instead of feature based syntax trees and first-order
hql('e and the constants are source word sense predicates logical forms we will adopt a simpler, monostratal rep-
I'1, t"-' . . . . and target sense predicates ql, q2, . . . .
resentation that is more closely related to those found
More specifically, we will say the translation relation in dependency grammars (e.g. Hudson 1984). Depen-
hH we~,n a source logical form Cs and a target logical dency representations have been used in large scale
i;,r~t 6t holds if we have qualitative machine translation systems, notably by
McCord (1988). The notion of a lexical 'head' of a
/~ u .'~'u A' ~ (q~., ~ ~,) phrase is central to these representations because they
wh,.n, I~ is a s~.t of monolingual and bilingual mean-
concentrate on relations between such lexical heads. In
I J;:. i,t).~l.ulal.es, a n d ,S' is ;t s e t o f f o r m u l a s characterizing
our case, the dependency representation is monostratal
I.h*' ~'lli'l','llt COllt~xt. .'l I is a s,,t of assumptions that in that the relations may include ones normally classi-
fied as belonging to syntax, semantics or l)ragmatics.
in,h=,h's I.h,' assunlptions A which SUl)ported ~bs. ilere
I,ili,,~ual me;ruing i~osl.ulal.~.s a.re first order axioms re- One salient property of our language model is that it
hll.ing source and target sense predicates. A typical is strongly lexical: it consists of statistical parameters
I,ilin~ual posl.ulate Ibr translal.ing between Pl an(I ql associated with relations between lexical items and the
ii~it~;lil h,, of th,. for,n: number and ordering of dependents of lexical heads.
This lexical anchoring facilitates statistical training and
4
sensitivity to lexical variation and collocations. In order For tighter integration between getmraliovt aml sy,,tl,~',
to gain the benefits of probabilistic modeling, we replace sis, information about the derivation of I.Iw l,arg,'l uI
the task of developing large rule sets with the task of I,erance can also I)c passed to the syuthesizcr.
estimating large numbers of statistical parameters for
the monolingual and translation models. This gives rise Integrated Statistical Model
to a new cost trade-off in human annotation/judgement The probabilities associated with phrases in the abov,,
versus barely tractable fully automatic training. It also description are computed according to the statistical
necessitates further research on lexical similarity and models for analysis, translation, and generation. In this
clustering (e.g. Pereira, Tishby and Lee 1993, Dagan, section we show the relationship between these mod-
Marcus and Markovitch 1993) to improve parameter els to arrive at an overall statistical model of sp,,,.,"h
estimation from sparse data. translation. We are not considering training ismws in
this paper, though a number of now familiar techniques
Translation via Lexical Relation Graphs ranging from methods for maximum likelihood estima-
The model associates phrases with relation graphs. A tion to direct estimation using fully annotated data are
relation graph is a directed labeled graph consisting of applicable.
a set of relation edges. Each edge has the form of an The objects involved in the overall model are as Jbl-
atomic proposition lows (we omit target speech synthesis under the, as-
sumption that it proceeds deterministically from a tar-
~(wi,w~) get language word string):
• A0: (acoustic evidence for) source language spe~'ch
where r is a relation symbol, wi is the lexical head of
a phrase and wj is the lexical head of another phrase • Wo: source language word string
(typically a subphrase of the phrase headed by w~). The • Wz: target language word string
nodes wi and wj are word occurrences representable by • C0: source language relation graph
a word and an index, the indices uniquely identifying
particular occurrences of the words in a discourse or • Ct: target language relation graph
corpus. The set of relation symbols is open ended, but Given a spoken input in the source language, we
the first argument of the relation is always interpreted wish to find a target language string that is the most
as the head and the second as the dependent with re- likely translation of the input. We are thus interestc.d
spect to this relation. The relations in the models for in the conditional probability of We given A,. This
the sour~:e and target languages need not be the same, conditional probability can be expressed as follows (of.
or even overlap. To keep the language models simple, Chang and Su 1993):
we will mainly restrict ourselves here to dependency
graphs that are trees with unordered siblings. In partic- P(WdA,) =
ular, phrases will always be contiguous strings of words
and dependents will always be heads of subphrases. ~W,,C,,Ct P(WolAo) P(C, IW,, A,)
Ignoring algorithmic issues relating to compactly rep- P(CdCo, W,, A°) PCWd(:,, C,, W.,, ,4, ).
resenting and efficiently searching the space of alterna-
tive hypotheses, the overall design of the quantitative
system is as follows. The speech recognizer produces We now apply some simplifying independence .s-
a set of word-position hypotheses (perhaps in the form sumptions concerning relation graphs. Specifically. that
of a word lattice) corresponding to a set of string hy- their derivation from word strings is independent of
potheses for the input. The source language model is acoustic information; that their translation is indepen-
used to compute a set of possible relation graphs, with dent of the original words and acoustics involved; and
associated probabilities, for each string hypothesis. A that target word string generation from target relation
probabilistic graph translation model then provides, for edges is independent of the source language represent, a-
each source relation graph, the probabilities of deriving tions. The extent to which these (Markovian) assump-
corresponding graphs with word occurrences from the tions hold depend on the extent to which relation edges
target language. These target graphs include all the represent all the relevant information for translation.
words of possible translations of the utterance hypothe- In particular it means they should express aspects of
ses but do not specify the surface order of these words. surface relevant to meaning, such as topicalization, as
Probabilities for different possible word orderings are well as predicate argument structure. In any case, the
computed according to ordering parameters which form simplifying assumptions give the following:
part of the target language model.
In the following section we explain how the probabil- P(W~IA, ) _~
ities for these various processing stages are combined to ~w.,c.,c, P( W, IA, ) P(C01W,)P( Ct lCo ) P( Wt I£:, ).
select the most likely target word sequence. This word
sequence can then be handed to the speech synthesizer. This can be rewritten with two applications of Bay,,~
5
I'llh': where C ranges over relation graphs. The content
model, P(C), and generation model, P(WIC), are com-
v"
L.,W.. ,C~,('t P( A, IW,) ( I / P(A.,)) P(WolC,) ponents of the overall statistical model for spoken lan-
guage translation given earlier. This decomposition of
P(C,) P(C~IC, ) P(W, ICt). P(W) can be viewed as first deciding on the content of
a sentence, formulated as a set of relation edges accord-
ing to a statistical model for P(C), and then deciding
Since A, is given, lIP(A,) is a constant which can be on word order according to P(WIC ).
ignored in finding the maximum of P(Wt]As). Deter- Of course, this decomposition simplifies the realities
mining Wt that maximizes P(WdA, ) therefore involves of language production in that real language is always
the following factors: generated in the context of some situation S (real or
* I'(A, IW, ): source language acoustics imaginary), so a more comprehensive model would be
• /'([.V, IC,): source language generation concerned with P(CIS), i.e. language production in
context. This is less important, however, in the trans-
. I'(C.,): source content relations lation setting since we produce Ct in the context of a
• /'(('tiCs): source to target transfer source relation graph C, and we assume the availability
of a model for P(CtlC,).
• I'(IVtlC't ): target language generation
Wc a.,~ume that the speech recognizer provides acous- Content Derivation M o d e l
tic scores proportional to P(A, IW, ) (or logs thereof). The model for deriving the relation graph of a phrase
Sud~ scores are normally computed by speech recogni- is taken to consist of choosing a lexical head h0 for the
l i,,n systems, although they are usually also multiplied phrase (what the phrase is 'about') followed by a series
by w,,rd-based language model probabilities P(W,) of 'node expansion' steps. An expansion step takes a
which we do not require in this application context. node and chooses a possibly empty set of edges (relation
()ur approach to language modeling, which covers the labels and ending nodes) starting from that node. Here
corn.cat analysis and language generation factors, is pre- we consider only the case of relation graphs that are
:~,,uted in section 5 and the transfer probabilities fall trees with unordered siblings.
umh,r the translation model of section 6. To start with, let us take the simplified case where a
Finally note thai. by another application of Bayes head word h has no optional or duplicated dependents
,-,d,, w,, can replace the two factors P(C,)P(CdC,) by (i.e. exactly one for each relation). There will be a set
I'(Ct)l'(C, lCt} without changing other parts of the of edges
model. Tiffs latter fornmlation allows us to apply con-
straints imposed by the target language model to ill- E(h) = {rl(h, wl), r~(h, w2) ... r~(h, wk)}
t,'r inappropriate possibilities suggested by analysis and
tra.sfi~r. In some respects this is similar to Dagan and corresponding to the local tree rooted at h with depen-
Itai's (I 994) approach to word sense disambiguation us- dent nodes Wl...wk. The set of relation edges for the
ing statistical associations in a second language. entire derivation is the union of these local edge sets.
To determine the probability of deriving a relation
5. L a n g u a g e Models graph C for a phrase headed by h0 we make use of
Language Production Model parameters ('dependency parameters')
~).r bmguage model can be viewed in terms of a proba- P(r(h,w)lh, r)
bihstic generative process based on the choice of lexical
"heads" of phrases and the recursive generation of sub- for the probability, given a node h and a relation r,
;,bra~es and their ordering. For this purpose, we can de- that w is an r-dependent of h. Under the assumption
(ira, tho head word of a phrase to be the word that most that the dependents of a head are chosen independently
strongly influences the way the phrase may be com- from each other, the probability of deriving C is:
biucd with other phrases. This notion has been central
to a number of approaches to grammar for some time, P(C) = P(Top(ho)) I~Ir(h.~)¢c P(r(h, w)lh, r)
including theories like dependency grammar (Hudson
I!~7(;, 1990) and HPSG (Pollard and Sag 1987). More where P(Top(ho)) is the probability of choosing h0 to
;,'~,.t,l.ly, the statistical properties of associations be- start the derivation.
Iw,.,'n words, and more particularly heads of phrases, If we now remove the assumption made earlier that
JL:t.~J~,~'~,l|lql,all a.el.iw; area of research (e.g. Chang, l,uo, there is exactly one r-dependent of a head, we need to
aml Su 1992; Ilindlc and R.ooth 1993). elaborate the derivation model to include choosing the
'l'h,' language model factors the statistical derivation number of such dependents. We model this by param-
,,f a .~'ul.ence with word string W as follows: eters
9
l)agan, 1., S. Marcus and S. Markoviteh. 1993. "Con- Smajda, F. and K. McKeown. 1990. "Automatically
textual Word Similarity and Estimation from Sparse Extracting and Representing Collocations for Lan-
Data". Proceedings of the 31st meeting of the Associ- guage Generation". In Proceedings of the $Sth An-
atio~ for Computational Linouistics, ACL, 164-171. nual Meeting of the Association for Computational
Linguistics, Pittsburgh.
(;azdar, G., E. Klein, G.K. Pullum, and I.A.Sag. 1985.
Generalised Phrase Structure Grammar. Oxford: Taylor, L., C. Grover, sad E.J. Briscoe. 1989. "The
Blackwell. Syntactic Regularity of English Noun Phrases". Pro-
ceeedings of the 4th European ACL Conference, 256-
Ilindle, D. and M. Rooth. 1993. "Structural Ambiguity 263.
and Lexical Relations". Computational Linguistics
19:103-120. Weaver, W. 1955. "Translation". In W. Locke and
A. Booth (eds.), Machine Translation of Languages,
Ilobbs, J.R., M. Stickel, P. Martin and D. Edwards. Cambridge, Mass.: MIT Press.
1988. "Interpretation as Abduction", Proceedings of
the 26th Annual Meeting of the Association for Com-
putational Linguistics, Buffalo, New York, 95-103.
Itudson, R.A. 1984. Word Grammar. Oxford: Black-
woll.
Isah~,llo, P. and E. Macklovitch. 1986. "Transfer and
MT Modularity", Eleventh International Conference
on Computational Linguistics, Bonn, 115-117.
.Iclinek, F., R.L. Mercer and S. Roukos. 1992. "Princi-
ples of Lexieal Language Modeling for Speech Recog-
nition". In S. Furui and M.M. Sondhi (eds.), Ad-
vances in Speech Signal Processing, New York: Mar-
col Dekker Inc.
M,:llish, C.S. 1988. "Implementing Systemic Classifi-
cation by Unification". Computational Linguistics
14:40-51.
McCord, M. 1988. "A Multi-Target Machine Trans-
lation System". Proceedings of the International
Conference on Fifth Generation Computer Systems,
'[bkyo, Japan, 1141-1149.
I'ereira, F., N. Tishby and L. Lee. 1993. "Distri-
butional Clustering of English Words". Proceedings
of the 31st meeting of the Association for Computa-
tzonal Linguistics, ACL, 183-190.
I'oll~trd, C.J. and I.A. Sag. 1987. Information Based
,';yntax and Semantics: Volume I ~ Fundamentals.
CSI,I Lecture Notes, Number 13. Center for the
Study of Language and Information, Stanford, Cali-
fornia.
llayner, M. and H. Alshawi. 1992. "Deriving Database
Queries from Logical Forms by Abductive Definition
Expansion". Proceedings of the Third Conference on
Applied Natural Language Processing, Trento, Italy.
i{ichard, M.D. and R.P. Lippmann. 1991. "Neural Net-
work Classiliers Estimate Bayesian a posteriori Prob-
;d,ilili~.s". Neural Computation 3:461-483.
Shi,.l~,,r, S.M. 1986. An Introduction to Unification-
Ilascd Approaches to Grammar. CSLI Lecture Notes,
Number 4. Center for the Study of Language and
I i~form ation, Stanford, California.
10