Review of Hypothesis Alignment Algorithms For MT System Combination Via Confusion Network Decoding
Review of Hypothesis Alignment Algorithms For MT System Combination Via Confusion Network Decoding
Review of Hypothesis Alignment Algorithms For MT System Combination Via Confusion Network Decoding
Abstract 1 Introduction
191
192
J=0 S=0 E=1 SC=(1,1,0) W=twelve unique n-gram contexts before LM scores can be as-
J=1 S=0 E=1 SC=(0,0,1) W=dozen signed the arcs. Using long n-gram context may re-
J=2 S=1 E=2 SC=(1,0,0) W=big quire pruning to reduce memory usage. Given uni-
J=3 S=1 E=2 SC=(0,1,1) W=NULL form initial system weights, pruning may remove
J=4 S=2 E=3 SC=(1,0,1) W=blue desirable paths. In this work, the lattices were ex-
J=5 S=2 E=3 SC=(0,1,0) W=NULL panded to bi-gram context and no pruning was per-
J=6 S=3 E=4 SC=(1,1,1) W=cars formed. A set of bi-gram decoding weights were
tuned directly on the expanded lattices using a dis-
Figure 2: A lattice in text format representing the con- tributed optimizer (Rosti et al., 2010). Since the
fusion network in Figure 1. J is the arc index, S and E
score in Equation 2 is not a simple log-linear inter-
are the start and end node indexes, SC is a vector of arc
scores, and W is the word label. polation, the standard minimum error rate training
(Och, 2003) with exact line search cannot be used.
Instead, downhill simplex (Press et al., 2007) was
system was aligned to a given link2 . These may be used in the optimizer client. After bi-gram decod-
viewed as system specific word confidences, which ing weight optimization, another set of 5-gram re-
are binary when aligning 1-best system outputs. If scoring weights were tuned on 300-best lists gener-
no word from a hypothesis is aligned to a given link, ated from the bi-gram expanded lattices.
a NULL word token is generated provided one does
not already exist, and the corresponding element in
the NULL word token is set to one. The system
specific word scores are kept separate in order to 3 Hypothesis Alignment Algorithms
exploit system weights in decoding. Given system
weights wn , which sum to one, and system specific
word scores snj for each arc j (the SC elements), the Two different methods have been proposed for
weighted word scores are defined as: building confusion networks: pairwise and incre-
Ns
X
mental alignment. In pairwise alignment, each
sj = wn snj (1) hypothesis corresponding to a source sentence is
n=1 aligned independently with the skeleton hypothe-
sis. This set of alignments is consolidated using the
where Ns is the number of input systems. The hy-
skeleton words as anchors to form the confusion net-
pothesis score is defined as the sum of the log-word-
work (Matusov et al., 2006; Sim et al., 2007). The
scores along the path, which is linearly interpolated
same word in two hypotheses may be aligned with a
with a logarithm of the language model (LM) score
different word in the skeleton resulting in repetition
and a non-NULL word count:
in the network. A two-pass alignment algorithm to
X
S(E|F ) = log sj + γSLM (E) + δNw (E) improve pairwise TER alignments was introduced in
j∈J (E) (Ayan et al., 2008). In incremental alignment (Rosti
(2) et al., 2008), the confusion network is initialized by
where J (E) is the sequence of arcs generating the forming a simple graph with one word per link from
hypothesis E for the source sentence F , SLM (E) the skeleton hypothesis. Each remaining hypothesis
is the LM score, and Nw (E) is the number of is aligned with the partial confusion network, which
non-NULL words. The set of weights θ = allows words from all previous hypotheses be con-
{w1 , . . . , wNs , γ, δ} can be tuned so as to optimize sidered as matches. The order in which the hypothe-
an evaluation metric on a development set. ses are aligned may influence the alignment qual-
Decoding with an n-gram language model re- ity. Rosti et al. (2009) proposed a sentence specific
quires expanding the lattice to distinguish paths with alignment order by choosing the unaligned hypoth-
2
A link is used as a synonym to the set of arcs between two
esis closest to the partial confusion network accord-
consecutive nodes. The name refers to the confusion network ing to TER. The following five alignment algorithms
structure’s resemblance to a sausage. were used in this study.
193
3.1 Pairwise GIZA++ Enhanced Hypothesis alignment of system outputs. ITGs form an edit
Alignment distance, invWER (Leusch et al., 2003), that per-
mits properly nested block movements of substrings.
Matusov et al. (2006) proposed using the GIZA++
For well-formed sentences, this may be more nat-
Toolkit (Och and Ney, 2003) to align a set of tar-
ural than allowing arbitrary shifts. The ITG algo-
get language translations. A parallel corpus where
rithm is very expensive due to its O(n6 ) complexity.
each system output acting as a skeleton appears as
The search algorithm for the best ITG alignment, a
a translation of all system outputs corresponding to
best-first chart parsing (Charniak et al., 1998), was
the same source sentence. The IBM Model 1 (Brown
augmented with an A∗ search heuristic of quadratic
et al., 1993) and hidden Markov model (HMM) (Vo-
complexity (Klein and Manning, 2003), resulting in
gel et al., 1996) are used to estimate the alignment.
significant reduction in computational complexity.
Alignments from both “translation” directions are
The finite state-machine heuristic computes a lower
used to obtain symmetrized alignments by interpo-
bound to the alignment cost of two strings by allow-
lating the HMM occupation statistics (Matusov et
ing arbitrary word re-orderings. The ITG hypothesis
al., 2004). The algorithm may benefit from the fact
alignment algorithm was extended to operate incre-
that it considers the entire test set when estimating
mentally in (Karakos et al., 2010) and a novel ver-
the alignment model parameters; i.e., word align-
sion where the cost function is computed based on
ment links from all output sentences influence the
the stem/synonym similarity of (Snover et al., 2009)
estimation, whereas other alignment algorithms only
was used in this work. Also, a sentence specific
consider words within a pair of sentences (pairwise
alignment order was used. This aligner is referred
alignment) or all outputs corresponding to a single
to as iITGp later in this paper.
source sentence (incremental alignment). However,
it does not naturally extend to incremental align- 3.4 Incremental Translation Edit Rate
ment. The monotone one-to-one alignments are then Alignment with Flexible Matching
transformed into a confusion network. This aligner
is referred to as GIZA later in this paper. Sim et al. (2007) proposed using translation edit rate
scorer3 to obtain pairwise alignment of system out-
3.2 Incremental Indirect Hidden Markov puts. The TER scorer tries to find shifts of blocks
Model Alignment of words that minimize the edit distance between
the shifted reference and a hypothesis. Due to the
He et al. (2008) proposed using an indirect hidden
computational complexity, a set of heuristics is used
Markov model (IHMM) for pairwise alignment of
to reduce the run time (Snover et al., 2006). The
system outputs. The parameters of the IHMM are
pairwise TER hypothesis alignment algorithm was
estimated indirectly from a variety of sources in-
extended to operate incrementally in (Rosti et al.,
cluding semantic word similarity, surface word sim-
2008) and also extended to consider synonym and
ilarity, and a distance-based distortion penalty. The
stem matches in (Rosti et al., 2009). The shift
alignment between two target language outputs are
heuristics were relaxed for flexible matching to al-
treated as the hidden states. A standard Viterbi al-
low shifts of blocks of words as long as the edit dis-
gorithm is used to infer the alignment. The pair-
tance is decreased even if there is no exact match in
wise IHMM was extended to operate incrementally
the new position. A sentence specific alignment or-
in (Li et al., 2009). Sentence specific alignment or-
der was used by this aligner, which is referred to as
der is not used by this aligner, which is referred to
iTER later in this paper.
as iIHMM later in this paper.
3.5 Incremental Translation Edit Rate Plus
3.3 Incremental Inversion Transduction
Alignment
Grammar Alignment with Flexible
Matching Snover et al. (2009) extended TER scoring to con-
sider synonyms and paraphrase matches, called
Karakos et al. (2008) proposed using inversion trans-
3
duction grammars (ITG) (Wu, 1997) for pairwise http://www.cs.umd.edu/˜snover/tercom/
194
TER-plus (TERp). The shift heuristics in TERp outputs were detokenized before computing case in-
were also relaxed relative to TER. Shifts are allowed sensitive BLEU scores. Statistical significance was
if the words being shifted are: (i) exactly the same, computed for each pairwise comparison using boot-
(ii) synonyms, stems or paraphrases of the corre- strapping (Koehn, 2004).
sponding reference words, or (iii) any such combina-
tion. Xu et al. (2011) proposed using an incremental Decode Oracle
version of TERp for building consensus networks. A Aligner tune test tune test
sentence specific alignment order was used by this GIZA 60.06 57.95 75.06 74.47
aligner, which is referred to as iTERp later in this iTER 59.74 58.63† 73.84 73.20
paper. iTERp 60.18 59.05† 76.43 75.58
iIHMM 60.51 59.27†‡ 76.50 76.17
4 Experimental Evaluation iITGp 60.65 59.37†‡ 76.53 76.05
Combination experiments were performed on (i) Table 2: Case insensitive BLEU scores for NIST MT09
Arabic-English, from the informal system combi- Arabic-English system combination outputs. Note, four
nation track of the 2009 NIST Open MT Evalua- reference translations were available. Decode corre-
tion4 ; (ii) German-English from the system com- sponds to results after weight tuning and Oracle corre-
sponds to graph TER oracle. Dagger (†) denotes statisti-
bination evaluation of the 2011 Workshop on Sta-
cally significant difference compared to GIZA and double
tistical Machine Translation (Callison-Burch et al., dagger (‡) compared to iTERp and the aligners above it.
2011) (WMT11) and (iii) Spanish-English, again
from WMT11. Eight top-performing systems (as
The BLEU scores for Arabic-English system
evaluated using case-insensitive BLEU) were used
combination outputs are shown in Table 2. The first
in each language pair. Case insensitive BLEU scores
column (Decode) shows the scores on tune and test
for the individual system outputs on the tuning and
sets for the decoding outputs. The second column
test sets are shown in Table 1. About 300 and
(Oracle) shows the scores for oracle hypotheses ob-
800 sentences with four reference translations were
tained by aligning the reference translations with the
available for Arabic-English tune and test sets, re-
confusion networks and choosing the path with low-
spectively, and about 500 and 2500 sentences with a
est graph TER (Rosti et al., 2008). The rows rep-
single reference translation were available for both
resenting different aligners are sorted according to
German-English and Spanish-English tune and test
the test set decoding scores. The order of the BLEU
sets. The system outputs were lower-cased and to-
scores for the oracle translations do not always fol-
kenized before building confusion networks using
low the order for the decoding outputs. This may be
the five hypothesis alignment algorithms described
due to differences in the compactness of the confu-
above. Unpruned English bi-gram and 5-gram lan-
sion networks. A more compact network has fewer
guage models were trained with about 6 billion
paths and is therefore less likely to contain signif-
words available for these evaluations. Multiple com-
icant parts of the reference translation, whereas a
ponent language models were trained after dividing
reference translation may be generated from a less
the monolingual corpora by source. Separate sets
compact network. On Arabic-English, all incremen-
of interpolation weights were tuned for the NIST
tal alignment algorithms are significantly better than
and WMT experiments to minimize perplexity on
the pairwise GIZA, incremental IHMM and ITG
the English reference translations of the previous
with flexible matching are significantly better than
evaluations, NIST MT08 and WMT10. The sys-
all other algorithms, but not significantly different
tem combination weights, both bi-gram lattice de-
from each other. The incremental TER and TERp
coding and 5-gram 300-best list re-scoring weights,
were statistically indistinguishable. Without flexi-
were tuned separately for lattices build with each hy-
ble matching, iITG yields a BLEU score of 58.85 on
pothesis alignment algorithm. The final re-scoring
test. The absolute BLEU gain over the best individ-
4
http://www.itl.nist.gov/iad/mig/tests/ ual system was between 6.2 and 7.6 points on the
mt/2009/ResultsRelease/indexISC.html test set.
195
Arabic German Spanish
System tune test tune test tune test
A 48.84 48.54 21.96 21.41 27.71 27.13
B 49.15 48.97 22.61 21.80 28.42 27.90
C 49.30 49.50 22.77 21.99 28.57 28.23
D 49.38 49.59 22.90 22.41 29.00 28.41
E 49.42 49.75 22.90 22.65 29.15 28.50
F 50.28 50.69 22.98 22.65 29.53 28.61
G 51.49 50.81 23.41 23.06 29.89 29.82
H 51.72 51.74 24.28 24.16 30.55 30.14
Table 1: Case insensitive BLEU scores for the individual system outputs on the tune and test sets for all three source
languages.
Table 3: Case insensitive BLEU scores for WMT11 Table 4: Case insensitive BLEU scores for WMT11
German-English system combination outputs. Note, only Spanish-English system combination outputs. Note, only
a single reference translation per segment was available. a single reference translation per segment was available.
Decode corresponds to results after weight tuning and Decode corresponds to results after weight tuning and
Oracle corresponds to graph TER oracle. Dagger (†) Oracle corresponds to graph TER oracle. Dagger (†)
denotes statistically significant difference compared to denotes statistically significant difference compared to
iTERp and GIZA. aligners above iIHMM.
196
down into insertions/deletions/substitutions/shifts non-NULL words NULL words
based on the TER scorer. weak strong weak strong
The error counts at the document level were ag- Arabic 0.087 -0.068 0.192 0.094
gregated. For each document in each collection, the German 0.117 -0.067 0.206 0.147
number of errors of each type that resulted from each Spanish 0.085 -0.134 0.323 0.102
individual system as well as each system combina-
tion were measured, and their difference was com- Table 6: Regression coefficients of the “strong” and
puted. If the differences are mostly positive, then ”weak” agreement features, as computed with a gener-
it can be said (with some confidence) that system alized linear model, using TER as the target variable.
combination has a significant impact in reducing the
error of that type. A paired Wilcoxon test was per- of the combined systems contribute a word. To sig-
formed and the p-value that quantifies the probabil- nify the fact that real words and “NULL” tokens
ity that the measured error reduction was achieved have different roles and should be treated separately,
under the null hypothesis that the system combina- two sets of agreement statistics were computed.
tion performs as well as the best system was com- A regression with a generalized linear model
puted. (glm) that computed the coefficients of the agree-
Table 5 shows all conditions under consideration. ment quantities (as explained above) for each align-
All cases where the p-value is below 10−2 are con- ment scheme, using TER as the target variable, was
sidered statistically significant. Two observations performed. Table 6 shows the regression coeffi-
are in order: (i) all alignment schemes significantly cients; they are all significant at p-value < 0.001.
reduce the number of substitution/shift errors; (ii) As is clear from this table, the negative coefficient of
in the case of insertions/deletions, there is no clear the “strong” agreement quantity for the non-NULL
trend; there are cases where the system combination words points to the fact that good aligners tend to
increases the number of insertions/deletions, com- result in reductions in translation error. Further-
pared to the individual systems. more, increasing agreements on NULL tokens does
5.2 Relationship between Word Agreement not seem to reduce TER.
and Translation Error
6 Conclusions
This set of experiments aimed to quantify the rela-
tionship between the translation error rate and the This paper presented a systematic comparison of
amount of agreement that resulted from each align- five different hypothesis alignment algorithms for
ment scheme. The amount of system agreement at MT system combination via confusion network de-
a level x is measured by the number of cases (con- coding. Pre-processing, decoding, and weight tun-
fusion network arcs) where x system outputs con- ing were controlled and only the alignment algo-
tribute the same word in a confusion network bin. rithm was varied. Translation quality was compared
For example, the agreement at level 2 is equal to 2 qualitatively using case insensitive BLEU scores.
in Figure 1 because there are exactly 2 arcs (with The results showed that confusion network decod-
words “twelve” and “blue”) that resulted from the ing yields a significant gain over the best individ-
agreement of 2 systems. Similarly, the agreement at ual system irrespective of the alignment algorithm.
level 3 is 1, because there is only 1 arc (with word Differences between the combination output using
“cars”) that resulted from the agreement of 3 sys- different alignment algorithms were relatively small,
tems. It is hypothesized that a sufficiently high level but incremental alignment consistently yielded bet-
of agreement should be indicative of the correctness ter translation quality compared to pairwise align-
of a word (and thus indicative of lower TER). The ment based on these experiments and previously
agreement statistics were grouped into two values: published literature. Incremental IHMM and a novel
the “weak” agreement statistic, where at most half incremental ITG with flexible matching consistently
of the combined systems contribute a word, and the yield highest quality combination outputs. Further-
“strong” agreement statistic, where more than half more, an error analysis shows that most of the per-
197
Language Aligner ins del sub shft
GIZA 2.2e-16 0.9999 2.2e-16 2.2e-16
iHMM 2.2e-16 0.433 2.2e-16 2.2e-16
Arabic iITGp 0.8279 2.2e-16 2.2e-16 2.2e-16
iTER 4.994e-07 3.424e-11 2.2e-16 2.2e-16
iTERp 2.2e-16 1 2.2e-16 2.2e-16
GIZA 7.017e-12 2.588e-06 2.2e-16 2.2e-16
iHMM 6.858e-07 0.4208 2.2e-16 2.2e-16
German iITGp 0.8551 0.2848 2.2e-16 2.2e-16
iTER 0.2491 1.233e-07 2.2e-16 2.2e-16
iTERp 0.9997 0.007489 2.2e-16 2.2e-16
GIZA 2.2e-16 0.8804 2.2e-16 2.2e-16
iHMM 2.2e-16 1 2.2e-16 2.2e-16
Spanish iITGp 2.2e-16 0.9999 2.2e-16 2.2e-16
iTER 2.2e-16 1 2.2e-16 2.2e-16
iTERp 3.335e-16 1 2.2e-16 2.2e-16
Table 5: p-values which show which error types are statistically significantly improved for each language and aligner.
formance gains from system combination can be at- nation using discriminative cross-adaptation. In Proc.
tributed to reductions in substitution errors and word IJCNLP, pages 667–675.
re-ordering errors. Finally, better alignments of sys- Jonathan G. Fiscus. 1997. A post-processing system to
tem outputs, which tend to cause higher agreement yield reduced word error rates: Recognizer output vot-
ing error reduction (ROVER). In Proc. ASRU, pages
rates on words, correlate with reductions in transla-
347–354.
tion error. Robert Frederking and Sergei Nirenburg. 1994. Three
heads are better than one. In Proc. ANLP, pages 95–
100.
References Xiaodong He and Kristina Toutanova. 2009. Joint opti-
Necip Fazil Ayan, Jing Zheng, and Wen Wang. 2008. mization for machine translation system combination.
Improving alignments for better confusion networks In Proc. EMNLP, pages 1202–1211.
for combining machine translation systems. In Proc. Xiaodong He, Mei Yang, Jianfeng Gao, Patrick Nguyen,
Coling, pages 33–40. and Robert Moore. 2008. Indirect-hmm-based hy-
Srinivas Bangalore, German Bordel, and Giuseppe Ric- pothesis alignment for combining outputs from ma-
cardi. 2001. Computing consensus translation from chine translation systems. In Proc. EMNLP, pages 98–
multiple machine translation systems. In Proc. ASRU, 107.
pages 351–354. Almut S. Hildebrand and Stephan Vogel. 2008. Combi-
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della nation of machine translation systems via hypothesis
Pietra, and Robert L. Mercer. 1993. The mathematics selection from combined n-best lists. In AMTA, pages
of statistical machine translation: Parameter estima- 254–261.
tion. Computational Linguistics, 19(2):263–311. Shyamsundar Jayaraman and Alon Lavie. 2005. Multi-
Chris Callison-Burch, Philipp Koehn, Christof Monz, engine machine translation guided by explicit word
and Omar F. Zaidan. 2011. Findings of the 2011 matching. In Proc. EAMT.
workshop on statistical machine translation. In Proc. Damianos Karakos, Jason Eisner, Sanjeev Khudanpur,
WMT, pages 22–64. and Markus Dreyer. 2008. Machine translation sys-
Eugene Charniak, Sharon Goldwater, and Mark Johnson. tem combination using ITG-based alignments. In
1998. Edge-based best-first chart parsing. In Proc. Proc. ACL, pages 81–84.
Sixth Workshop on Very Large Corpora, pages 127– Damianos Karakos, Jason R. Smith, and Sanjeev Khu-
133. Morgan Kaufmann. danpur. 2010. Hypothesis ranking and two-pass ap-
Jacob Devlin, Antti-Veikko I. Rosti, Shankar Ananthakr- proaches for machine translation system combination.
ishnan, and Spyros Matsoukas. 2011. System combi- In Proc. ICASSP.
198
Dan Klein and Christopher D. Manning. 2003. A* Antti-Veikko I. Rosti, Bing Zhang, Spyros Matsoukas,
parsing: Fast exact Viterbi parse selection. In Proc. and Richard Schwartz. 2009. Incremental hypothesis
NAACL, pages 40–47. alignment with flexible matching for building confu-
Philipp Koehn. 2004. Statistical significance tests for sion networks: BBN system description for WMT09
machine translation evaluation. In Proc. EMNLP, system combination task. In Proc. WMT, pages 61–
pages 388–395. 65.
Gregor Leusch, Nicola Ueffing, and Hermann Ney. 2003. Antti-Veikko I. Rosti, Bing Zhang, Spyros Matsoukas,
A novel string-to-string distance measure with appli- and Richard Schwartz. 2010. BBN system descrip-
cations to machine translation evaluation. In Proc. MT tion for WMT10 system combination task. In Proc.
Summit 2003, pages 240–247, September. WMT, pages 321–326.
Chi-Ho Li, Xiaodong He, Yupeng Liu, and Ning Xi. Antti-Veikko I. Rosti, Evgeny Matusov, Jason Smith,
2009. Incremental hmm alignment for mt system com- Necip Fazil Ayan, Jason Eisner, Damianos Karakos,
bination. In Proc. ACL/IJCNLP, pages 949–957. Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spy-
Lidia Mangu, Eric Brill, and Andreas Stolcke. 2000. ros Matsoukas, Hermann Ney, Richard Schwartz, Bing
Finding consensus in speech recognition: Word error Zhang, and Jing Zheng. 2011. Confusion network de-
minimization and other applications of confusion net- coding for MT system combination. In Joseph Olive,
works. Computer Speech and Language, 14(4):373– Caitlin Christianson, and John McCary, editors, Hand-
400. book of Natural Language Processing and Machine
Evgeny Matusov, Richard Zens, and Hermann Ney. Translation: DARPA Global Autonomous Language
2004. Symmetric word alignments for statistical ma- Exploitation, pages 333–361. Springer.
chine translation. In Proc. COLING, pages 219–225. Khe Chai Sim, William J. Byrne, Mark J.F. Gales,
Evgeny Matusov, Nicola Ueffing, and Hermann Ney. Hichem Sahbi, and Phil C. Woodland. 2007. Con-
2006. Computing consensus translation from multiple sensus network decoding for statistical machine trans-
machine translation systems using enhanced hypothe- lation system combination. In Proc. ICASSP.
ses alignment. In Proc. EACL, pages 33–40. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Franz J. Och and Hermann Ney. 2003. A systematic nea Micciula, and John Makhoul. 2006. A study of
comparison of various statistical alignment models. translation edit rate with targeted human annotation.
Computational Linguistics, 29(1):19–51. In Proc. AMTA, pages 223–231.
Franz J. Och. 2003. Minimum error rate training in sta- Matthew Snover, Nitin Madnani, Bonnie Dorr, and
tistical machine translation. In Proc. ACL, pages 160– Richard Schwartz. 2009. Fluency, adequacy or
167. HTER? exploring different human judgments with a
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- tunable MT metric. In Proc. WMT, pages 259–268.
Jing Zhu. 2002. BLEU: a method for automatic eval- Stephan Vogel, Hermann Ney, and Christoph Tillman.
uation of machine translation. In Proc. ACL, pages 1996. HMM-based word alignment in statistical trans-
311–318. lation. In Proc. ICCL, pages 836–841.
William H. Press, Saul A. Teukolsky, William T. Vetter- Dekai Wu. 1997. Stochastic inversion transduction
ling, and Brian P. Flannery. 2007. Numerical recipes: grammars and bilingual parsing of parallel corpora.
the art of scientific computing. Cambridge University Computational Linguistics, 23(3):377–403, Septem-
Press, 3rd edition. ber.
Antti-Veikko I. Rosti, Spyros Matsoukas, and Rirchard Daguang Xu, Yuan Cao, and Damianos Karakos. 2011.
Schwartz. 2007a. Improved word-level system com- Description of the JHU system combination scheme
bination for machine translation. In Proc. ACL, pages for WMT 2011. In Proc. WMT, pages 171–176.
312–319.
Antti-Veikko I. Rosti, Bing Xiang, Spyros Matsoukas,
Richard Schwartz, Necip Fazil Ayan, and Bonnie J.
Dorr. 2007b. Combining outputs from multiple
machine translation systems. In Proc. NAACL-HLT,
pages 228–235.
Antti-Veikko I. Rosti, Bing Zhang, Spyros Matsoukas,
and Richard Schwartz. 2008. Incremental hypothesis
alignment for building confusion networks with appli-
cation to machine translation system combination. In
Proceedings of the Third Workshop on Statistical Ma-
chine Translation, pages 183–186.
199