Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Jiang, Yuchen Eleanor; Liu, Tianyu; Ma, Shuming; Zhang, Dongdong; Sachan, Mrinmaya; Cotterell, Ryan

Computer Science > Computation and Language

arXiv:2305.11142 (cs)

[Submitted on 18 May 2023]

Title:Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Authors:Yuchen Eleanor Jiang, Tianyu Liu, Shuming Ma, Dongdong Zhang, Mrinmaya Sachan, Ryan Cotterell

View PDF

Abstract:Several recent papers claim human parity at sentence-level Machine Translation (MT), especially in high-resource languages. Thus, in response, the MT community has, in part, shifted its focus to document-level translation. Translating documents requires a deeper understanding of the structure and meaning of text, which is often captured by various kinds of discourse phenomena such as consistency, coherence, and cohesion. However, this renders conventional sentence-level MT evaluation benchmarks inadequate for evaluating the performance of context-aware MT systems. This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. (2022). The new BWB annotation introduces four extra evaluation aspects, i.e., entity, terminology, coreference, and quotation, covering 15,095 entity mentions in both languages. Using these annotations, we systematically investigate the similarities and differences between the discourse structures of source and target languages, and the challenges they pose to MT. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures. This gives us a new perspective on the challenges and opportunities in document-level MT. We make our resource publicly available to spur future research in document-level MT and the generalization to other language translation tasks.

Comments:	9 pages. arXiv admin note: substantial text overlap with arXiv:2210.14667
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.11142 [cs.CL]
	(or arXiv:2305.11142v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.11142
Journal reference:	ACL 2023

Submission history

From: Yuchen Eleanor Jiang [view email]
[v1] Thu, 18 May 2023 17:36:41 UTC (5,371 KB)

Computer Science > Computation and Language

Title:Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators