S2MT Paper
S2MT Paper
S2MT Paper
Table 2: Percentage of UCCA divergences according to their types (columns) that have certain properties (rows). All numbers
are percentages computed over all UCCA divergences of the given type. Note that the properties are not mutually exclusive
(see text). Participant and Adverbial divergences are only evaluated on passages with no Scene divergences.
nouvel abordage”. The French noun “victime” de- pairs with AMR. Our analysis shows that AMR
scribes a result, while the corresponding English conserves the main structures in most sentences (7
“victimized” is an action. The unaligned Scene out of 10), and suggests that other semantic anno-
is in English. It is therefore an English Scene tations may also be structurally stable. However,
divergence. In the example “He slowly ran”/”Il semantic roles, used in PropBank and AMR, are
a couru” we saw above, there is no Scene diver- often a source of divergences across the languages.
gences but the English Adverbial “slowly” is un-
7.3 Properties of UCCA Divergences
aligned, creating an English Adverbial divergence.
In order to examine the causes and semantic types
7.2 Number of UCCA Divergences of the different divergences, we manually classi-
The analysis of Scene divergences is performed fied each of them according to three groups of
manually over the entire set of passages. The anal- properties, which are not mutually exclusive. The
ysis of Participant and Adverbial divergences is results of the divergence analysis are presented in
restricted to passages with no Scene divergences, Table 2.
i.e., with a perfect Scene to Scene correspondence Translation study: The properties in this group
(57 passages of the total 154). This permits the investigate whether a given UCCA divergence can
capture of lower level divergences which are not be avoided using a different formulation closer to
just consequences of the divergences at the Scene the one used in the other language. This approach
level. evaluates the translator’s choices and creativity.
We found a total of 112 English Scene diver- Properties #1 and #2 check whether different for-
gences and 72 French ones. This amounted to mulations can be used in the source and target side
92.3% of the English Scenes having a French cor- respectively, that would avoid the UCCA diver-
respondent and 94.9% of the French Scenes hav- gence. Results show that many of the divergences
ing an English correspondent. Only 25% of the can be indeed ascribed to the specific translation
sentences (148 out of 583) contains any Scene di- selected. For example, only less than a third of
vergences. the Scene divergences in each language could not
Concerning Participant divergences, we found have been avoided through a different translation.
that 694 out of 738 English Participants (94.0%) We thus speculate that in a more technical and less
have a correspondent in French. 694 of the 728 literary corpus, the number of UCCA divergences
French Participants (95.3%) have a correspondent will be lower.
in English. 100 out of the 124 English Adverbials Annotation study: These properties study the
(80.6%) have a correspondent in French, and 100 influence of the annotator’s preferences. Prop-
out of the 126 Adverbials (79.4%) have a corre- erty #3 (conforming analysis) covers cases where
spondent in English. Thus, our results show low UCCA allows another analysis which would have
rates of UCCA French-English divergences. avoided the divergence. While both annotations
We also conduct a preliminary study into the are permitted, one of them is sometimes preferred,
applicability of another semantic scheme, namely to capture a nuance of meaning conveyed by one
AMR, to our domain. We annotate 10 sentence language but not the other. Property #4 refers to
Replaced by Scene Div. Participant Div. Adverbial Div.
Eng. Fre. Eng. Fre. Eng. Fre.
Linker 6.25 1.39 – – 8.33 7.69
Ground 1.79 1.39 – – 4.17 3.85
Elaborator of Participant – – 0 2.94 4.17 19.23
Main Relation – – 20.45∗ 20.59∗ 25.0∗ 26.92∗
Parallel Scene – – 13.64 2.94 – –
Participant – – – – 4.17 11.54
Adverbial – – 6.82 2.94 – –
2 Participants – – 11.36 2.94 – –
2 Adverbials – – – – 4.17 0.0
None 91.96 98.21 47.73 67.65 50.0 30.77
Table 3: Analysis of divergences in terms of replacements by other UCCA categories. Columns correspond to divergence
types, while rows correspond to the category, as defined in Abend and Rappoport (2013b), of the replacing unit. All numbers
are given in percents. Percentage is taken over all UCCA divergences of the same type. ∗ : In these cases, a Participant or an
Adverbial in one of the languages is included in the meaning of the main relation (Process or State) in the other language.
divergences resulting from different readings (am- siderably reduce this kind of divergence.
biguity) allowed by the text, where one meaning To summarize, our study sheds light on the cir-
was selected in one language and another in the cumstances in which UCCA divergences arise and
other. The results for this group (properties #3 suggests how many divergences can be avoided.
and #4) reveal that most of the Scene and Ad- This study also contributes to the understanding
verbial divergences could have been avoided had of the differences between original and translated
a different annotation been selected. This sug- texts, which can improve MT (Lembersky et al.,
gests that more restrictive annotation guidelines or 2013).
some post-annotation normalization can substan-
8 Conclusion
tially reduce the number of divergences.
Effect of the unaligned unit: Divergences are We showed that basic semantic structures can be
often a result of a semantic or pragmatic difference stably preserved across English-French transla-
between the source text and its translation. Prop- tions. This means that semantic structures may be
erty #5 addresses cases where additional informa- more suitable to SMT systems than syntactic ones,
tion is conveyed by the unaligned unit. Property which exhibit well known divergence phenomena.
#6 is a sub-case of #5 that specifically addresses We used the UCCA scheme, but we expect these
tense information. Property #7 addresses cases advantages to generalize to other structured se-
where the unaligned unit emphasizes some aspect mantic schemes. Future work will address the in-
of meaning. The results show that many diver- tegration of UCCA into structure-based SMT ei-
gences can be ascribed to a true semantic differ- ther by adding UCCA as features to phrase-based
ence between the source and the translation. and syntax-based systems, or by replacing exist-
ing syntactic structures with UCCA structures. We
Finally, in some cases, the UCCA divergences
also plan to investigate related tasks that would
simply replace one UCCA category with another
benefit from UCCA’s stability like bilingual align-
(Table 3). In these cases there are unaligned units
ment and MT evaluation.
in the English and the French sides that roughly
correspond to one another semantically, but have
Acknowledgments
different UCCA categories. Cases of replacement
are common with Participant and Adverbial diver- We would like to thank Roy Schwartz for help-
gences, but fairly rare in the case of Scene diver- ful comments. This research was supported by the
gences. In case of Adverbial divergences, many Language, Logic and Cognition Center (LLCC)
of them result from including the meaning of an at the Hebrew University of Jerusalem (for the
Adverbial in one language in the meaning of the first author) and by the ERC Advanced Fellowship
main relation (Process or State) in the other lan- 249520 GRAMPLUS (for the second author).
guage. This can be seen as a generalization of
demotional/promotional divergences (Dorr, 1994)
discussed in Section 4.2. Annotating secondary References
verbs (e.g., “begin” or “try”) as Adverbials instead Anne Abeillé, François Toussenel, and Martine
of being part of the main relation, as was done in Chéradame, 2004. Corpus le Monde An-
the latest version of UCCA’s guidelines, may con- notation en constituents Guide pour les cor-
recteurs. http://www.llf.cnrs.fr/Gens/ Robert M.W. Dixon. 2012. Basic Linguistic Theory:
Abeille/guide-annot.pdf. Further Grammatical Topics, volume 3. Oxford
University Press.
Omri Abend and Ari Rappoport. 2013a. UCCA: A
semantic-based grammatical annotation scheme. In Bonnie J. Dorr, Lisa Pearl, Rebecca Hwa, and Nizard
Proc. of IWCS-13, pages 1–12. Habash. 2002. DUSTer: a method for unraveling
cross-language divergences for statistical word-level
Omri Abend and Ari Rappoport. 2013b. Universal alignment. In Proc.of AMTA-02, pages 31–43.
Conceptual Cognitive Annotation (UCCA). In Proc.
of ACL-13, pages 228–238. Bonnie J. Dorr, Eduard H. Hovy, and Lori S. Levin.
2004. Machine translation: interlingual methods. In
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. Encyclopedia of language and linguistics, 2nd edi-
1998. The Berkeley Framenet project. In Proc. of tion. ms.939, Brown, Keith.
ACL-COLING-98, pages 86–90.
Bonnie Dorr, Rebecca Passonneau, David Farwell, Re-
Laura Banarescu, Claire Bonial, Shu Cai, Madalina becca Green, Nizar Habash, Stephen Helmreich, Ed-
Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin ward Hovy, Lori Levin, Keith Miller, Teruko Mi-
Knight, Philipp Koehn, Martha Palmer, and Nathan tamura, Owen Rambow, and Advaith Siddharthan.
Schneider. 2013. Abstract Meaning Representa- 2010. Interlingual annotation of parallel text cor-
tion for sembanking. In Proc. of Linguistic Annota- pora: A new framework for annotation and evalu-
tion Workshop and Interoperability with Discourse, ation. Natural Language Engineering, pages 197–
pages 178–186. 243.
Mazrieh Bazrafshan and Daniel Gildea. 2013. Seman- Bonnie Dorr. 1990. Solving thematic divergences in
tic roles for string to tree machine translation. In machine translation. In Proc. of ACL-90, pages 127–
Proc. of ACL-13 (Short Paper), pages 419–423. 134.
Ann Bies, Mark Ferguson, Karen Katz, and Robert Bonnie J. Dorr. 1994. Machine translation diver-
MacIntyre, 1995. Bracketting Guidelines for Tree- gences: a formal description and proposed solution.
bank II Style Penn Treebank Project. Linguistic Computational linguistics, 20(4):597–635.
Data Consortium.
Shimon Edelman and Zach Solan. 2009. Ma-
Alexandra Birch, Barry Haddow, Ulrich Germann, chine translation using automatically inferred
Maria Nadejde, Chrsitian Buck, and Philipp Koehn. construction-based correspondence and language
2013. The feasibility of HMEANT as a human MT models. In Proc. of PACLIC-09, pages 654–661.
evaluation metric. In Proc. of the 8th Workshop on
SMT, ACL-13, pages 52–61. Minwei Feng, Wiwei Sun, and Hermann Ney. 2012.
Semantic cohesion model for phrase-based SMT.
Alexandra Birch. 2011. Reordering metrics for statis- Proc. of COLING-12, pages 867–878.
tical machine translation. Ph.D. thesis, University
of Edinburgh. Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,
Chris Dyer, and Noah A. Smith. 2014. A discrimi-
Aljoscha Burchardt, Katrin Erk, Anette Frank, An- native graph-based parser for the Abstract Meaning
drea Kowalski, Sebastian Padó, and Manfred Pinkal. Representation. In Proc. of ACL-14, pages 1426–
2009. Using FrameNet for the semantic analysis 1436.
of German: Annotation, representation and automa-
tion. In Hans C. Boas (Herausgeber), editor, Multi- Pascale Fung, Zhojun Wu, Yongsheng Yang, and Dekai
lingual FrameNets in Computational Lexicography Wu. 2006. Learning of Chinese/English semantic
- Methods and Applications, pages 209–244, New structure mapping. In Workshop on Spoken Lan-
York/Berlin. Mouton de Gruyter. guage Technology, IEEE/ACL-06, pages 230–233.
David Chiang. 2005. A hierarchical phrase-based Pascale Fung, Zhaojun Wu, Yongsheng Yang, and
model for statistical machine translation. In Proc. Dekai Wu. 2007. Learning bilingual semantic
of ACL-05, pages 263–270. frames: shallow semantic parsing vs. semantic role
projection. In Proc. of the 11th Conference on
Yuan Ding and Martha Palmer. 2004. Synchronous de- Theoretical and Methodological Issues in Machine
pendency insertion grammars: a grammar formalism Translation (TMI 2007), pages 75–84.
for syntax based statistical MT. In Workshop on Re-
cent Advances in Dependency Grammars, COLING- Francesca Gola. 2012. An analysis of translation
04. divergence patterns using PanLex translation pairs.
Master’s thesis, University of Washington.
Robert M.W. Dixon. 2010a. Basic Linguistic Theory:
Grammatical Topics, volume 2. Oxford University Spence Green, Marie-Catherine de Marneffe, John
Press. Bauer, and Christopher D. Manning. 2011. Mul-
tiword expression identification with Tree Substitu-
Robert M.W. Dixon. 2010b. Basic Linguistic Theory: tion Grammars: a parsing tour de force with French.
Methodology, volume 1. Oxford University Press. In Proc. of EMNLP-11, pages 725–735.
Roger Hawkins and Richard Towell. 2001. French Tanja Samardžić, Lonneke van der Plas, Goljihan
Grammar and Usage. McGraw-Hill, 2nd edition. Kashaeva, and Paola Merlo. 2010a. The scope
and the sources of variation in verbal predicates in
Bevan Jones, Jacob Andreas, Daniel Bauer, English and French. In Proc. of the 9th Interna-
Karl Moritz Hermann, and Kevin Knight. 2012. tional Workshop on Treebanks and Linguistic The-
Semantics-based machine translation with hy- ories, pages 199–211.
peredge replacement grammars. In Proc. of
COLING-12, pages 1359–1376. Tanja Samardžić, Lonneke van der Plas, Goljihan
Kashaeva, and Paola Merlo. 2010b. Variation in
Dan Klein and Christopher D. Manning. 2003. Ac- verbal predicates in English and French. Generative
curate unlexicalized parsing. In Proc. of ACL-03, Grammar in Geneva, 6:109–135.
pages 423–430.
Elior Sulem. 2014. Integration of a cognitive anno-
Mikhail Kozhevnikov and Ivan Titov. 2013. Cross- tation into machine translation: Theoretical founda-
lingual transfer of semantic role labeling models. In tions and bilingual corpus analysis. Master’s thesis,
Proc. of ACL-13, pages 1190–1200. Hebrew University of Jerusalem.
Ronald W. Langacker. 2008. Cognitive Grammar: A Kristina Toutanova, Dan Klein, Christopher Manning,
Basic Introduction. Oxford University Press, USA. and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
Gennadi Lembersky, Noam Ordan, and Shuly Wintner. In Proc. of HLT-NAACL-03, pages 252–259.
2013. Improving statistical machine translation by
adapting translation models to translationese. Com- Hiroshi Uchida. 1987. ATLAS: Fujitsu machine
putational Linguistics, 39(4):999–1023. translation system. In Machine Translation Summit,
Japan.
Mike Lewis and Mark Steedman. 2013. Unsuper-
Zdeňka Urešová, Ondřej Dušek, Eva Fučı́ková, Jan
vised induction of cross-lingual semantic relations.
In Proc. of EMNLP-13, pages 681–692. Hajič, and Jana Šindlerová. 2015. Bilingual
English-Czech valency lexicon linked to a parallel
Junhui Li, Philip Resnik, and Hal Daumé III. 2013. corpus. In Proc. of the 9th Linguistic Annotation
Modeling syntactic and semantic structures in hi- Workshop (The LAW IX), pages 124–128.
erarchical phrase-based translation. In Proc. of
Lonneke van der Plas, Tanja Samardžić, and Paola
NAACL-HLT-13, pages 540–549.
Merlo. 2010. Cross-lingual validity of PropBank
in the manual annotation of French. In Proc. of the
Ding Liu and Daniel Gildea. 2010. Semantic role fea-
4th Linguistic Annotation Workshop (The LAW IV),
tures for machine translation. In Proc. of COLING-
pages 113–117.
10, pages 716–724.
Jules Verne. 1870. Vingt Mille Lieues Sous les Mers.
Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree- J. Hetzel. http://fr.wikisource.org/
to-string alignment template for statistical machine wiki/Vingt_mille_lieues_sous_les_
translation. In Proc. of COLING-ACL-06, pages mers.
609–616.
Jules Verne. 1991. Twenty Thousands Leagues Un-
Daniel Marcu, Wei Wang, Abdessamad Echihabi, and der the Sea. Translated from the original French
Kevin Knight. 2006. SPMT: Statistical machine by J.P. Walter. http://jv.gilead.org.il/
translation with syntactified target language phrases. fpwalter.
In Proc. of EMNLP-06, pages 44–52.
Ralph Weischedel, Sameer Pradhan, Lance Ramshaw,
Haitao Mi, Liang Huang, and Qun Liu. 2008. Forest- Jeff Kaufman, Michelle Franchini, Mohammed El-
based translation. In Proc. of ACL-08 HLT, pages Bachouti, Nianwen Xue, Martha Palmer, Jena D.
192–199. Hwang, Claire Bonial, Jinho Choi, Aous Mansouri,
Maha Foster, Abdel aati Hawwary, Mitchell Marcus,
Sergei Nirenburg. 1989. New developments in Ann Taylor, Craig Greenberg, Eduard Hovy, Robert
knowledge-based machine translation. In Alatis Belvin, and Ann Houston. 2012. OntoNotes Re-
J.E., editor, Georgetown University Round Table on lease 5.0. Technical report, Linguistic Data Consor-
Languages and Linguistics 1989: Language teach- tium 2013T19.
ing, testing, and technology: lessons from the past
with a view toward the future, pages 344–357. Dekai Wu and Pascale Fung. 2009. Semantic roles for
Georgetown University Press. SMT: a hybrid two-pass model. In Proc. of NAACL-
HLT-09 (Short Paper), pages 13–19.
Martha Palmer, Daniel Gildea, and Paul Kingsbury.
2005. The proposition bank: A annotated cor- Deyi Xiong, Min Zhang, and Haizhou Li. 2012. Mod-
pus of semantic roles. Computational Linguistics, eling the translation of predicate-argument structure
31(1):149–159. for SMT. In Proc. of ACL-12, pages 902–911.
Nianwen Xue, Ondřej Bojar, Jan Hajič, Martha Palmer,
Zdeňka Urešová, and Xiuhong Zhang. 2014. Not an
interlingua, but close: comparison of English AMRs
to Chinese and Czech. In Proc. of LREC-14, pages
1765–1772.