0% found this document useful (0 votes)
19 views31 pages

2022 wmt-1 3

The document summarizes the findings of the WMT 2022 shared task on quality estimation. It introduces new datasets annotated with multidimensional quality metrics to enable more fine-grained quality estimation at the word and sentence levels. It also extends existing datasets to new language pairs and includes an explainability subtask and a revised critical error detection task. The shared task evaluated 991 systems across 3 prediction tasks and various language pairs to advance the state-of-the-art in quality estimation.

Uploaded by

ay no
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views31 pages

2022 wmt-1 3

The document summarizes the findings of the WMT 2022 shared task on quality estimation. It introduces new datasets annotated with multidimensional quality metrics to enable more fine-grained quality estimation at the word and sentence levels. It also extends existing datasets to new language pairs and includes an explainability subtask and a revised critical error detection task. The shared task evaluated 991 systems across 3 prediction tasks and various language pairs to advance the state-of-the-art in quality estimation.

Uploaded by

ay no
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Findings of the WMT 2022 Shared Task on Quality Estimation

Chrysoula Zerva1,2 , Frédéric Blain3 , Ricardo Rei2,4,5 , Piyawat Lertvittayakumjorn6 ,


José G. C. de Souza4 , Steffen Eger9 , Diptesh Kanojia8 , Duarte Alves2 , Constantin Orăsan8 ,
Marina Fomicheva7 , André F. T. Martins1,2,4 and Lucia Specia6,7
1
Instituto de Telecomunicações, 2 Instituto Superior Técnico, 3 University of Wolverhampton
4
Unbabel, 5 INESC-ID, 6 Imperial College London, 7 University of Sheffield
8
University of Surrey, 9 NLLG,Technische Fakultät, Bielefeld University
[email protected],[email protected], {d.kanojia, c.orasan}@surrey.ac.uk
{chrysoula.zerva,ricardo.rei,duartemalves}@tecnico.ulisboa.pt, [email protected]

Abstract approaches. These, as well as previously published


datasets for QE, rely mainly on Direct Assessments
We report the results of the WMT 2022 shared (DA)1 and post-edited translations, which provide
task on Quality Estimation, in which the chal-
estimates of quality either by using the human qual-
lenge is to predict the quality of the output of
neural machine translation systems at the word ity score(s) for each segment or by estimating the
and sentence levels, without access to refer- distance of a translation from a human-provided
ence translations. This edition introduces a few correction. As these annotations can sometimes
novel aspects and extensions that aim to enable obscure the exact location and/or significance of
more fine-grained, and explainable quality esti- a translation error, we wanted to investigate the
mation approaches. We introduce an updated feasibility and efficiency of using a more fine-
quality annotation scheme using Multidimen-
grained annotation schema to obtain quality esti-
sional Quality Metrics to obtain sentence- and
word-level quality scores for three language mations at word- and sentence- level, namely Mul-
pairs. We also extend the Direct Assessments tidimensional Quality Metrics (MQM) (Lommel
and post-edit data (MLQE-PE) to new language et al., 2014). MQM annotations have shown to be
pairs: we present a novel and large dataset on more trustworthy for the metrics task (Freitag et al.,
English-Marathi, as well as a zero-shot test-set 2021a,b), motivating us to evaluate their suitability
on English-Yoruba. Further, we include an ex- for the QE task. We make available new develop-
plainability sub-task for all language pairs and ment and test data on three language pairs using
present a new format of a critical error detection
MQM annotations.
task for two new language pairs. Participants
from 11 different teams submitted altogether The aforementioned boost in performance of QE
991 systems to different task variants and lan- systems frequently comes at the cost of efficiency
guage pairs. and interpretability, since they heavily rely on large
models with many parameters. As a result, the pre-
1 Introduction dicted quality estimates are hard to interpret. At
The 11th edition of the shared task on Quality Es- the same time, such high-performance, “black-box”
timation (QE) builds on its previous editions and models are frequently susceptible to systematic
findings to further benchmark methods for estimat- errors, such as negation omission (Kanojia et al.,
ing the quality of neural machine translation (MT) 2021) and mistranslated entities (Amrhein and Sen-
output at run-time, without the use of reference nrich, 2022). Both phenomena are major concerns
translations. It includes (sub)tasks that consider for MT quality estimation since they can under-
quality of machine translations at the word and mine users’ trust in new technologies and ham-
sentence levels. per the adoption of such models on a wide scale.
Over the past years, the QE field has been mov- To motivate approaches that address these cases
ing towards trainable, large, multilingual models we include an explainability subtask following its
that have been shown to achieve high performance, first edition at Eval4NLP 2021 (Fomicheva et al.,
especially at sentence-level (Specia et al., 2021). 2021). In this subtask we ask participants to predict
In this edition, we further expand the provided re- 1
We note that the procedure followed for our data diverges
sources, introducing new low-resource language from that proposed by Graham et al. (2016) in three ways: (a)
pairs: a large dataset of English-Marathi, suit- we employ fewer but professional translators to score each
sentence, (b) scoring is done against the source segment (bilin-
able for training, development and testing and a gual annotation) and not the reference, and (c) we provide
smaller test-set on English-Yoruba for zero-shot translators with guidelines on the meaning of ranges of scores.

69
Proceedings of the Seventh Conference on Machine Translation (WMT), pages 69 - 99
December 7-8, 2022 ©2022 Association for Computational Linguistics
the erroneous words as rationale extraction for a Task 3 The critical Error Detection task, aiming to
sentence-level quality estimate, without any word- predict sentence-level binary scores indicating
level supervision. By framing error identification whether or not a translation contains a critical
as rationale extraction for sentence-level quality error (§2.3).
estimation systems, this subtask offers an opportu-
nity to study whether such systems behave in the The tasks make use of large datasets annotated
same way as humans would do. We also reshape by professional translators with either 0-100 DA
the critical error detection task of last year and we scoring, post-editing or MQM annotations. We up-
build a new corpus to test the ability of QE systems date the training and development datasets of pre-
to detect critical errors that simulate hallucinated vious editions and provide new test sets for Tasks
content with additions, deletions, named entities, 1 and 2. Additionally, we provide a novel setup
polarity changes and numbers. The corpus is cre- for Task 3, with novel train, development and test
ated using S MAUG (Alves et al., 2022) and we al- data. The2 datasets and models released are publicly
low participation in constrained and unconstrained available . Participants are also allowed to explore
settings. For the constrained setting, participants any additional data and resources deemed relevant,
have to build QE systems without having access across tasks.
to data from S MAUG, whereas participants from The shared task uses CodaLab as submission
the unconstrained task can train their systems using platform, where participants (Section 4) could sub-
additional data from S MAUG. mit up to 2 submissions a day for each task and
In addition to advancing the state-of-the-art at language pair (LP), up to a total of 10 submissions.
all prediction levels, our main goals are: Results for all tasks evaluated according to standard
metrics are given in Section 5. Baseline systems
• To extend the languages covered in our were trained by the task organisers and entered in
datasets; the platform to provide a basis for comparison (Sec-
tion 3). A discussion on the main goals and findings
• To further motivate fine-grained quality anno- from this year’s task is presented in Section 6.
tation, informed at word and sentence level
using MQM; 2 Quality Estimation tasks
In what follows, we briefly describe each subtask,
• To encourage language-independent and even
including the datasets provided for them.
unsupervised approaches especially for zero-
shot prediction; 2.1 Task 1: Predicting translation quality
Being able to automatically predict the quality
• To study and promote explainable approaches
of translations on sentence- or word-level with-
for MT evaluation; and
out access to human-references is the core goal
• To revisit critical error detection. of the QE shared task. In this edition, we ex-
plored some new approaches towards quality anno-
We thus have three tasks: tations for sentence- and word-level, and redefined
the word-level quality labelling scheme, in an at-
Task 1 The core QE task, consisting of separate tempt to allow participants to employ multi-task
sentence-level and word-level subtasks. For approaches and exploit fine-grained quality annota-
the sentence-level sub-tasks, the goal is to pre- tions. Hence, the data was produced in two ways:
dict a quality score for each segment in the
test set, which can be a variant of DA (§2.1.1) 1. DA & Post-edit approach: The quality of each
or MQM (§2.1.1). For the word-level sub- source-translation pair is annotated by at least
tasks, participants have to predict translation 3 independent expert annotators, using DA on
errors at word-level, via binary quality tags a scale 0-100. The translation is also post-
(see §2.1.2). edited to obtain the closest possible, fully cor-
rect translation of the source. Using the post-
Task 2 Explainable QE task, aiming to obtain word- edited data, we generate Human-mediated
level rationales for sentence-level quality 2
https://github.com/WMT-QE-Task/
scores (§2.2). wmt-qe-2022-data

70
Translation Edit Rate (HTER) (Snover et al., healthcare domain data was obtained from publicly
2006) scores, which are obtained by calculat- available NHS monolingual corpus3 .
ing the minimum edit distance between the All of the data was translated using large
machine translation and its manually post- transformer-based NMT models, with established
edited version. By additionally considering high performance for the languages in question.
the alignment between the source and post- Specifically, for the language pairs in the training
edited sentence, we can propagate the errors to data (En-De, En-Zh, Et-En, Ne-En, Ru-En, Ro-
the source sentence and annotate the segments En, Si-En), all source sentences were translated
that were potentially mistranslated and/or not by a fairseq Transformer (Ott et al., 2019) bilin-
translated at all. The HTER scores were made gual model. The exception is the English-Marathi
available to participants as additional data, but which was translated by the multilingual IndicTrans
are not used as prediction targets. (En-X) Transformer-based NMT model, which was
trained on the Samanantar parallel corpus (Ramesh
2. MQM approach: Each source-translation pair et al., 2022).
is evaluated by at least 1 expert annotator, and
For the languages provided in the development
errors identified in text are highlighted and
and test set, namely: En-Cz, En-Ja, Km-En and
classified in terms of severity (minor, major,
Ps-En we maintain the same we use the MBART50
critical) and type (omission, style, mistransla-
(Tang et al., 2020),4 to translate the source sentence
tion, etc).
of the other languages pairs, since it has been found
The DA and MQM data was further processed to perform well, especially for low-resource lan-
to a) obtain normalised quality scores that have guages (Tang et al., 2020). The En-Mr portion of
the same direction between high and low quality the development and test data is translated similarly
and b) obtain word-level binary quality labels. We to the training data for this language pair.
provide more details on the required pre-processing
Zero-shot language pair: This year we intro-
in §2.1.1 and §2.1.2.
duced a “surprise” language-pair, English-Yoruba
DA & Post-edit data: For all language pairs (En-Yo), which represents a low-resource language
the data provided is selected from publicly avail- pair. The Yoruba language is the third most spoken
able resources. Specifically for training we language in Africa, and it is native to southwest-
used the following language pairs from the ern Nigeria and the Republic of Benin (Eberhard
MLQE-PE dataset (Fomicheva et al., 2022): et al., 2020). We extracted 1010 sentences in En-
English-German (En-De), English-Chinese (En- glish from Wikipedia across 7 topics and translated
Zh), Russian-English (Ru-En), Romanian-English them to Yoruba using Google Translate. Using
(Ro-En), Nepalese-English (Ne-En), Esthonian- adjusted guidelines from Fomicheva et al. (2021),
English (Et-En) and Sinhala-English (Si-En), we trained annotators to indicate sentence-level
which are all sampled from Wikipedia, except DA scores and to highlight erroneous words as
for the Ru-En pair, which also contains sentences word-level explanations for the DA scores.5 On
from Reddit. Additionally, the language-pairs the 1010 sentences, they obtained agreements of
used for development and testing also originate 0.487 Pearson on sentence-level and 0.380 kappa
from Wikipedia: English-Czech (En-Cs), English- on word-level. Note that in order to further en-
Japanese (En-Ja), Khmer-English (Km-En) and courage multilingual and unsupervised approaches,
Pashto-English (Ps-En). the setup for this zero-shot approach was slightly
Finally, the new English-Marathi (En-Mr) data different to the previous edition, since we did not
that is made available for train, development and reveal the language pair before the release of the
testing this year is sampled from a combination of test data, and the zero-shot pair was included only
sources. More specifically the source side segments in the multilingual sub-tasks for quality estimation
of the English-Marathi data contain segments from 3
The NHS corpus source sentences were crawled from
three different domains – healthcare, cultural, and the health directory of NHS available here: https://www.
general/news. The general domain and cultural do- nhs.uk/conditions/
4
main data were obtained from the English (source https://github.com/pytorch/fairseq/
tree/master/examples/multilingual
side) segments in the IITB English-Hindi Parallel 5
Annotators were graduate students and native speakers of
Corpus (Kunchukuttan et al., 2018). However, the Yoruba and fluent in English.

71
(as opposed to a standalone subtask for this lan- same annotation conventions. All translations were
guage pair only). manually annotated for perceived quality, with a
quality label ranging from 0 to 100, following the
MQM data: As training data, we used annota-
FLORES guidelines (Guzmán et al., 2019). Ac-
tions released for the Metrics shared task namely,
cording to the guidelines given to annotators, the
the concatenation of the annotations released from
0-10 range represents an incorrect translation; 11-
Freitag et al. (2021a) with the annotations from
29, a translation with few correct keywords, but the
last year Metrics task (Freitag et al., 2021b). To-
overall meaning is different from the source; 30-50,
gether, these annotations, cover 3 high-resource
a translation with major mistakes; 51-69, a trans-
language pairs, namely: Chinese-English (Zh-En),
lation which is understandable and conveys the
English-German (En-De) and English-Russian (En-
overall meaning of the source but contains typos or
Ru), and span across two domains (News and Ted
grammatical errors; 70-90, a translation that closely
Talks). In contrast to DA, instead of one transla-
preserves the semantics of the source sentence; and
tion for each source, we have multiple translations
91-100, a perfect translation. For each segment,
coming from system participation’s in the 2020
there were at least three scores from independent
and 2021 News translation tasks (Barrault et al.,
raters (four in the case of En-Mr). DA scores were
2020; Akhbardeh et al., 2021). For development
standardised using the z-score by rater, and the
set however, we follow an approach that is similar
z-scores were provided as training targets. Par-
to the one use for the DA data: we translated the
ticipating systems are required to score sentences
Newstest 2019 using a single NMT system, namely
according to z-standardised DA scores.
MBART50. Subsequently, for each language pair
we asked an expert translator to provide MQM an- MQM annotations: As we have seen (§2.1), for
notations. The test set was created similarly to the the MQM annotations, we built on the available
development, but instead of using Newstest 2019 Google MQM annotations (Freitag et al., 2021a)
we used the Newstest 2022 (the News data from that contain annotated data for the En-De and Zh-
this year’s General MT shared task). En data of WMT 2020 News Translation Systems
Overall, the released data for Task 1covers a total (Barrault et al., 2020) as well as En-De, Zh-En and
of 9 language pairs for training, 4 language pairs En-Ru annotations from WMT Metrics 2021 (Fre-
for development and 6 language pairs for testing itag et al., 2021b). These annotations, provided
including 1 zero-shot language pair. Statistics and as training data, amount to more than 30,000 seg-
details for each language pair are provided in Table ments in total (see Table 1 for details per language
1. pair). In addition, we provide newly annotated
2.1.1 Sentence-level quality prediction development and test sets for all three language
pairs (En-De, En-Ru, Zh-En), amounting to ap-
There were two competition instances for the
proximately 1,000 segments per language pair.
sentence-level sub-task. The first one focuses on
Originally, MQM annotated segments include
DA- and the second one on MQM-derived annota-
annotated erroneous text-spans on the transla-
tions, both including a separate multilingual track.
tion side that are assigned two types of labels:
In the future, we aim to consolidate the competition
(a) an error severity label {minor, major,
instances into a single one for sentence-level, using
critical} and (b) an error category label such
our findings from this edition to align the annota-
as {grammar, style/awkward, omission,
tion schemes in a better manner. We provide below
mistranslation}, ...}. Each error sever-
the details for each annotation scheme and a com-
ity is associated with a specific weight; hence a
prehensive table with statistics for all annotations
sentence score can be calculated for each segment
(Table 1).
based on these error weights. We demonstrate an
DA annotations: For DA annotations, we fol- example of MQM annotations and scores in Figure
lowed the annotation and scoring conventions of 1.
previous editions. We provided MLQE-PE data MQM scores according to Google weight
(Fomicheva et al., 2022) used in previous years for scheme have the opposite direction of the DA
training, which includes seven language pairs with scores since larger MQM scores denote worse trans-
≈ 8,000 segments each. We also provided 26,000 lation quality, i.e., a larger number of errors or more
segments of En-Mr which were annotated using the severe errors. To address this inconsistency, we
72
Language Sentences Tokens DA PE MQM CE Data Source
Pairs Train / Dev / Test22 Train / Dev / Test22
En-De 1 9,000 / 1,000 / – 131,499 / 16,545 / – ✓ ✓ Wikipedia
En-Zh 9,000 / 1,000 / – 131,892 / 16,637 / – ✓ ✓ Wikipedia
Ru-En 9,000 / 1,000 / – 94,221 / 11,650 / – ✓ ✓ Reddit
Ro-En 9,000 / 1,000 / – 137,466 / 17,359 / – ✓ ✓ Wikipedia
Et-En 9,000 / 1,000 / – 112,503 / 14,044 / – ✓ ✓ Wikipedia
Ne-En 9,000 / 1,000 / – 120,078 / 15,017 / – ✓ ✓ Wikipedia
Si-En 9,000 / 1,000 / – 125,223 / 15,709 / – ✓ ✓ Wikipedia
En-Mr 26,000 / 1,000 / 1,000 690,532 / 27,049 / 26,253 ✓ ✓
Ps-En – / 1,000 / 1,000 – / 27,045 / 27,414 ✓ ✓ Wikipedia
Km-En – / 1,000 / 1,000 – / 21,981 / 22,048 ✓ ✓ Wikipedia
En-Ja – / 1,000 / 1,000 – / 20,626 / 20,646 ✓ ✓ Wikipedia
En-Cs – / 1,000 / 1,000 – / 20,394 /20,244 ✓ ✓ Wikipedia
En-Yo – / – / 1,010 – / – / 21,238 ✓ ✓
En-De 2 28,909 / 1,005 / 511 839,473 / 24,373 / 13,220 ✓ WMT-newstest
En-Ru 15,628 / 1,005 / 511 357,452 / 24,373 / 13,220 ✓ WMT-newstest
Zh-En 35,327 / 1,019 / 505 1,586,883 / 51,969 / 15,602 ✓ WMT-newstest
En-De 155,511 / 17,280 / 500 8,193,693 / 915,061 / 27,771 ✓ News-Commentary
Pt-En 39,926 / 4,437 / 500 2,281,515 / 253,594 / 29,794 ✓ News-Commentary

Table 1: Statistics of the data used for Task 1 (DA), Task 2 (PE) and Task 3 (CE) (last four rows). The number of
tokens is computed based on the source sentences.

Figure 1: Example of MQM annotations on the target (translation) side, on a English–German (En-De) sentence
pair.

invert the MQM scores and standardise per anno- 2.1.2 Word-level quality prediction
tator. For training data we had access to multiple
annotations per segment and calculated an aver- This sub-task focuses on detecting word-level er-
age score after standardisation, keeping also the rors in the MT output. The goal is to automatically
original MQM scores per annotator, to allow the predict the quality of each token using a binary
participants to take full advantage of the different decision, i.e., using OK as a label for tokens trans-
annotations (Basile et al., 2021). For the same lated correctly and BAD otherwise. We deviate
reasons, we opted not to aggregate the annotated from the annotation pattern of previous years in
text-spans. that, we do not consider annotations of the gaps
between tokens or source-side annotations. Instead,
to account for omission errors, we consider the fol-
lowing convention: the token on the right side of
Regarding evaluation, systems in this task (both the omitted text in the translation is annotated as
for DA and MQM) are evaluated against the true “BAD”. An additional <EOS> token is appended
z-normalised sentence scores using Spearman’s at the end of every translation segment to account
rank correlation coefficient ρ as the primary for omissions at the end of each sentence. This al-
metric. This is what was used for ranking sys- lows the provision of a unified framework for both
tem submissions. Pearson’s correlation coefficient, the post-edit originated annotations and the MQM
r, Mean Absolute Error (MAE), and Root Mean annotations.
Squared Error (RMSE) were also computed as sec- We thus use the same source-translation pairs
ondary metrics but not used for the final ranking used for the sentence-level tasks and obtain the
between systems. binary tags as follows:
73
• For post-edited data, we use TER (Snover the research on the explainability of QE systems,
et al., 2006) to obtain alignments between we encourage the participants to use or develop
translation and post-edit and annotate the mis- explanation methods to identify contributions of
aligned tokens as BAD. words or tokens in the input. Unlike Task 1, the
participants of Task 2 are not allowed to super-
• For MQM data, the tokens that fall within the
vise their models with any token-level or word-
text-spans annotated as errors (or any sever-
level labels or signals (whether they are from
ity or category) are annotated as BAD. If the
natural or synthetic data) in order to directly
whitespace between two words is annotated as
predict word-level errors. Consequently, we do
an error, then this is considered an omission,
not require the participants to convert their word-
and the next token is annotated as BAD.
level scores into predicted binary labels (OK/BAD)
Participants were encouraged to submit for each since this process usually requires a word-level QE
language pair and also for the multilingual vari- dataset to search for an optimum score threshold.
ants of each sub-task. For the DA-based sentence-
level competition, as well as the word-level sub-
task, there was an additional multilingual variant Concerning the evaluation of this task, we fo-
that included the zero-shot language pair (En-Yo). cus on assessing the quality of explanations (i.e.,
The latter aimed at fostering work on language- the submitted word-level scores), not the sentence-
independent models, as well as models that are level predictions. Specifically, we measure how
truly multilingual. well the word-level scores provided by the partici-
For word-level task, submissions are ranked us- pants correspond with human word-level error an-
ing the Matthews correlation coefficient (MCC) notations, which are binary ground truth labels. Un-
as the primary metric, while F1-scores are pro- like the Eval4NLP 2021 shared task, which ranked
vided as complementary information. participating systems by a combination of three
metrics (Fomicheva et al., 2021), we use Recall at
2.2 Task 2: Explainable Quality Estimation Top-K, also known as R-precision in information
Following the success of the shared task on Ex- retrieval literature (Manning et al., 2008, chap-
plainable Quality Estimation organized by the ter 8), as the primary metric this year due to
Eval4NLP workshop in 2021 (Fomicheva et al., two reasons. First, it is preferable to have a single
2021), in this sub-task we aim to address trans- main metric to avoid confusion and also some po-
lation error identification as rationale extraction tential side effects that combining the three metrics
from sentence-level quality estimation systems. If might produce. Second, Recall at Top-K seemed
a QE system reasonably estimates the quality of a to help discriminate best between the participating
translated sentence, an explanation extracted from submissions in the Eval4NLP shared task. Assume
the system should indicate word-level translation that, for a given pair of source and target sentences,
errors in the input (if any) as reasons for imper- there are K words annotated as translation errors
fect sentence-level scores. Particularly, for each by humans. Recall at Top-K equals Kr when there
input pair of source and target sentences, participat- are r out of the K error words appearing in the list
ing teams are asked to provide (i) a sentence-level of top-K words ranked by the submitted word-level
score estimating the translation quality and (ii) a list scores descendingly. In addition, AUC (an area un-
of continuous word-level scores where the tokens der the receiver operating characteristic curve) and
with the highest scores are expected to correspond AP (average precision) are used as secondary met-
to translation errors considered relevant by human rics. Considering the word level, AUC summarises
annotators. the curve between true positive rate and false posi-
In this explainable QE task, we use all the nine tive rate, while AP summarises the curve between
language pairs and their word-level test sets from precision and recall. For both of the secondary
Task 1 (see §2.1.2) with En-Yo being a separate metrics, higher values are the better. Although we
language pair (rather than blending it in the mul- report metrics for sentence-level predictions, in-
tilingual test set). Therefore, the participants are cluding Pearson’s correlation and Spearman’s cor-
allowed to use the sentence-level scores from the relation, as additional information, we do not use
datasets in Task 1 to train their sentence-level mod- them for ranking the participants or determining
els in Task 2. However, as Task 2 aims to promote the winner in this explainability task.
74
2.3 Task 3: Critical Error Detection Since the dataset for this task is artificially gen-
In this sub-task, we reshape the binary classifica- erated, the participants were encouraged to submit
tion task introduced in last year’s edition (Specia systems that did not rely on the provided training
et al., 2021) to predict whether the translated sen- data. As such, submissions were split into two
tence contains (at least) one critical error. groups: unconstrained and constrained. In the first
Following Specia et al. (2021), we consider that group, the participants have access to the training
a translation contains a critical error if it deviates data. In the second, the systems should only be
from the meaning of the source sentence in such a trained on quality scores such as DA, HTER and
way that it is misleading and may lead to several MQM annotations. With this setting, we aim to
implications. As noted by Specia et al. (2021), de- evaluate whether systems can identify critical er-
viations in meaning can happen in three ways: mis- rors while maintaining correlations with human
translation errors have critical content translated judgements.
incorrectly into a different meaning; hallucination In the evaluation of this task, the participants
errors introduce critical content in the translation were not required to submit any classification
that is not in the source; and deletion errors re- threshold for their systems. For the unconstrained
move critical content that is in the source from the setting, the systems are specifically trained to detect
translation. errors and should output high scores for translations
In this task, we focus on five critical error cate- containing these errors. As such, for each language-
gories: pair, we considered as positive predictions the K
records with highest scores, where K is the num-
• Additions: The content of the translation is ber of positive records for that language-pair in the
only partially supported by the source. test set. Regarding the constrained setting, these
systems are only trained on quality scores and are
• Deletions: Part of the source sentence is ig- expected to assign lower scores to translations with
nored by the MT engine. critical errors. Therefore, we considered the K
records with lowest scores as positive predictions.
• Named Entities: A named entity (people, or-
From here, we measured the MCC, Recall and Pre-
ganization, location, etc.) is mistranslated into
cision for each submission.
another incorrect named entity.

• Meaning: The translated sentence either intro- 3 Baseline systems


duces or removes a negation and the sentence
Task 1: Quality Estimation baseline systems:
meaning is completely reversed.
For Task 1, both for word and sentence-level, we
• Numbers: The MT system translates a num- used a multilingual transformer-based Predictor-
ber/date/time or unit incorrectly. Estimator approach (Kim et al., 2017), which is de-
scribed in detail in Fomicheva et al. (2022). For the
For this task, we introduce a new dataset ob- implementation and training we use the OpenKiwi
tained by perturbing a corpus of News articles with (Kepler et al., 2019) framework. We trained the
S MAUG (Alves et al., 2022) and using humans to baseline model using a multilingual and multitask
validate perturbation on the test set. The original setting and training jointly on the sentence-level
data for this task is composed of the News arti- scores and word-level tags. For the word-level loss,
cles from OPUS News-Commentary (Tiedemann, Lword , the weight of BAD tags is multiplied by a
2012) for the language pairs English-German and factor of λBAD = 3.0, but the sentence- and word-
Portuguese-English. level loss have equal weight in the overall joint
For the English-German language pairs, there loss estimation: L = Lword + Lsent . We trained
are no Deviation in Meaning errors, as the pertur- different baselines for the DA/post-edit originated
bation is only available for into English language language pairs and the MQM originated language-
pairs. The new dataset is purposefully unbalanced, pairs.
as these phenomena are rare, containing approxi- For the DA/post-edit baseline, the model was
mately 5% of translations with critical errors. Table trained using the DA scores as sentence targets and
1 presents the number of records for each language the OK/BAD tags as word targets. For training we
pair. used the concatenated data for all language pairs
75
available under training data and used the concate- Task 3: Critical Error Detection baseline sys-
nation of the additional language pairs that were tems: For task 3, we consider a baseline system
made available in the development set as valida- for each setting.
tion. We trained two baselines with this setup, us- In the constrained setting, we considered
ing different encoders for the encoding (predictor) COMET-QE (Rei et al., 2021)6 , which was a top-
part of the architecture: (a) XLM-R transformer performing QE-as-a-Metric system in last years
with the xlm-roberta-large model and (b) Metrics shared task (Freitag et al., 2021b).
RemBERT model which has been pre-trained on Regarding the unconstrained setting, we fine-
additional languages that include Yoruba and can tune an xlm-roberta-large model using the
hence account for the zero-shot language. COMET framework (Rei et al., 2020). Both the
source and translation are jointly encoded into a
For the MQM baseline, the model was trained
vector representation which is the input of a final
using the normalised and inverted MQM scores
estimator that predicts the probability of the transla-
as sentence targets and the OK/BAD tags as word
tion containing a critical error. Here, the estimator
targets. The baseline model was trained using the
weights are randomly initialised. We fine-tune the
concatenated training data for all three language
model on the provided training data for a maximum
pairs and used the concatenated development data
of 5 epochs. At the end of each epoch, we perform
for the same pairs as the validation set. The XLM-
a validation step by measuring the MCC on the
R transformer with the xlm-roberta-large
validation set considering a classification threshold
model was used as an encoder.
of 0.5. We select the model with the highest MCC
on the validation data.
Task 2: Explainability baseline systems: We 4 Participants
provide two baseline systems for Task 2. One is
a random baseline where we sampled scores uni- Alibaba-Translate (T1-DA): For the DA subtask,
formly at random from a continuous [0..1) range the team participated in all language pairs
for each target token and for a sentence-level score. except the zero-shot LP. The implemented
The other one is a combination of a supervised system (Wang et al., 2021), uses glass-box
quality annotation model, OpenKiwi (Kepler et al., QE features to estimate the uncertainty of ma-
2019) and LIME (Ribeiro et al., 2016) where chine translation segments and incorporates
OpenKiwi is used to predict sentence-level quality the features into the transfer learning from the
scores while LIME is used to compute, for every large-scale pre-trained model, XLM-R. The
token in the target sentence, its importance for the participants used exclusively the DA data pro-
sentence-level quality score returned by OpenKiwi. vided for this edition of the QE shared task. Of
For the OpenKiwi implementation we used a sim- the provided data, the 7 language pairs except
ilar setup described for the baselines of Task 1, for English-Marathi, were combined to train
but we trained the OpenKiwi model using only a multilingual model. For English-Marathi, a
sentence-level supervision, to align with the task re- separate bilingual model was trained. For the
quirements. We trained two multilingual instances, final submission the participants ensembled
one on DA- and one on MQM-derived data, using multiple checkpoints.
XLM-R large encoder in both cases. (T1-MQM): The submission for sentence-
LIME is a model-agnostic post-hoc explanation level MQM task is based on a multilingual
method which trains a linear model to estimate the unified framework for translation evaluation.
behavior of a target model (i.e., OpenKiwi in our The applied framework UniTE (Wan et al.,
case) around an input example to be explained so 2022) considers three input formats – source-
the weights of the linear model correspond to the only (QE or reference-free metric), reference-
importance of individual input tokens. Because only and source-reference-combined. The par-
higher sentence-level scores in our gold standard ticipants used synthetic datasets with pseudo
mean better translation quality, we invert token- labels during continuous pre-training phase,
level scores generated by LIME so that higher val- and fine-tuned with DA and MQM training
ues correspond to errors as required by the task 6
More precisely we used the wmt21-comet-qe-mqm
description. model

76
datasets from the year 2017 to 2021. To embeddings. After computing cosine similar-
obtain the final model predictions they use ity between target and source token embed-
the source-only evaluation. For multilingual dings, the max cosine similarity of each target
phase, they ensembled predictions using two token to all the source tokens is selected as
different backbones – one using XLM-R en- quality score. Intuitively, a low score means
coder and the other using InfoXLM. For the the target token is more likely to be an error
ensembling, they picked the best 2 check- (lack of good alignment), so every target word
points on the development dataset. quality score is multiplied by a negative value.

BJTU-Toshiba (T1-MQM): BJTU-Toshiba par- HyperMT - aiXplain (T1-all): The system is


ticipation focused on ensembling different trained with AutoML functionalities in
models and using external data. They ensem- FLAML framework using lightgbm estimator.
ble multiple pre-trained models, both mono- It utilizes COMET-QE score as feature
lingual and bilingual. The monolingual mod- along-side with many other linguistic features
els are trained only on the text of the target extracted with Stanza from source texts and
language. Specifically, they use monolingual their translations: the number of tokens,
BERT, Roberta, and Electra-discriminator as characters, and the average word length of
the monolingual extractor, and XLM-R as the sentences; the frequency of Part-of-Speech
bilingual extractor. They also use in-domain and Named Entity Recognition labels, and
parallel data to fine-tune and adapt the pre- the frequency of morphological features. The
trained models to the target language and do- differences in values of linguistic features
main. The in-domain data is selected by a between source texts and translations are also
BERT-classifier from the parallel data pro- included as features. This allows the system
vided by the news translation task, and for to work in multilingual settings as well.
each direction, they end up using roughly 1
million sentence pairs for fine-tuning. They IST-Unbabel (T1-all): IST-Unbabel team pro-
explore two styles of fine-tuning, namely posed an extension of COMET, dubbed
Translation Language Model and Replaced COMET-Kiwi, which includes a word-level
Token Detection. For Replaced Token Detec- layer and can be trained on both sentence-
tion, they use the first 1/3 layers of the model level scores and word-level labels in a multi-
as generator, and after the training they drop tasking fashion. Their final submission for
the generator and only use the discriminator task 1 is a weighted ensemble between mod-
as the feature extractor. els trained using InfoXLM (Chi et al., 2021)
and RemBERT (Chung et al., 2021). All
HW-TSC (T1): HW-TSC’s submission follows these models are pretrained on the data from
Predictor-Estimator framework with a pre- the metrics shared tasks and, for word-level,
trained XLM-R Predictor, a feed-forward Es- they pretrained on both QT21 and APE-Quest
timator for sentence-level QE subtask and a datasets.
binary classifier Estimator for word-level QE (T2) For the second task they use the
subtask. Specially, the Predictor is a cross- COMET-Kiwi framework as the backbone of
lingual language model that receives source a sentence-level QE model and added layer
and target tokens concatenated and returns and headwise parameters to the QE model: for
representations that attend to both languages. each layer and for each head, they train indi-
WMT 2022’s news translation task training vidual parameters to construct a sparse distri-
data is been used to train the Predictor us- bution over the layers/heads to better leverage
ing a cross-lingual masked language model these representations. They leveraged differ-
objective. All of the WMT QE 2022 DA and ent encoders – InfoXLM and RemBERT – and
MQM training data are used to train two differ- used them individually as the backbone of our
ent multilingual QE models, one for sentence- QE sentence-level models. The models used
level and another one for word-level. to extract explanations were multilingual ones
(T2:) The language encoder trained for Task trained for DA and MQM separately. The ex-
1 is being used to get source and target token plainability weights were obtained from the at-
77
tention weights scaled by the norm of the gra- regression and rank tasks) and word-level tags
dient of the value vectors (Chrysostomou and (with a sequence tagging task). For the fi-
Aletras, 2022). No word supervision was used nal submissions they ensemble sentence-level
and all explanations were extracted relying results by averaging all valid output scores
solely on models that produced the sentence- and ensemble word-level results using a vot-
level scores. The final submissions are ensem- ing mechanism. For the pseudo label genera-
bles of explanations from different attention tion they use publicly available parallel data,
layers/heads according to the validation data. specifically: the data provided by the WMT
For the zero-shot language pair (En-Yo), they translation task for En-De (9M), En-Ru (3M),
created an ensemble with the attention lay- and Zh-En (3M) language pairs. The 660K
ers/heads that were among the top-performing parallel sentences from OPUS7 for the Km-
ensembles for other language pairs. En language pair. They also use 3.6M parallel
(T3) For task 3 a single model from task 1 data from the target translation model8 for the
using InfoXLM encoder and trained on DA En-Mr language pair, as well as WMT2017,
annotations was submitted. WMT2019, and WMT2020 En-De PE data
for the En-De language pair.
KU X Upstage (T3): KU X Upstage employs an
Papago (T1-full): Papago submitted a multilin-
XLM-R large model without leveraging any
gual and multi-task model, trained to predict
additional parallel corpus. Instead, they at-
jointly both sentence and word level. The
tempt to maximise its capability by adopting
system’s architecture consists of Pretrained
prompt-based fine-tuning, which reformulates
Language Model with task independent layers
the Critical Error Detection task as a masked
optimized for both sentence and word level
language modelling objective (a pre-training
quality prediction. They propose an auxiliary
strategy of this model) before training. They
loss function to the final objective function to
generate hard prompts suitable for QE task
further improve performance. They also aug-
through prompt engineering, and templates
ment training data by either generating (i.e.
consist largely of three types according to
pseudo data) or collecting open source data
the information utilised: naive template, tem-
that is deemed to be relevant to QE task. Fi-
plate with a contrastive demo, and template
nally, they train and select the checkpoints for
with Google Translate. The final score is ob-
the final submission with cross-validation for
tained by extracting the probability of a word
better generalization and ensemble multiple
mapped to BAD among verbalizers. They
models for their final submission.
gain an additional performance boost from
the template ensemble by adding the values UCBerkeley-UMD (T1:DA): UCBerkeley-
from multiple templates. UMD used a large-scale multilingual model
to back translate from Czech to English. They
NJUNLP (T1-all): NJUNLP submission makes
compared the quality of the Czech translation
use of pseudo data and multi-task learning.
by examining the translation from Czech back
Inspired by DirectQE (Cui et al., 2021), they
to English with the original source text in
experiment with several novel methods to gen-
English. This is motivated by literature that
erate pseudo data for all three subtasks (MQM,
humans tend to perform quality checks on
DA, and PE) using the conditional masked lan-
translations when they do not understand the
guage model and the NMT model to generate
target language.
high quality synthetic data and pseudo labels.
The proposed methods control the decoding UT-QE (T2): The UT-QE team used XLMR-
process to generate more fluent pseudo trans- Score (Azadi et al., 2022) as an unsupervised
lations close to the actual distribution of the sentence-level metric, which is computed as
gold data. They pre-train the XLM-R large BERTScore but in a cross-lingual manner
model with the generated pseudo data and while using the XLM-R model. The matched
then fine-tune this model with the real QE task 7
https://opus.nlpl.eu/
data, using multi-task learning in both stages. 8
https://indicnlp.ai4bharat.org/
They jointly learn sentence-level scores (with indic-trans/

78
ID Affiliations
Alibaba Translate DAMO Academy, Alibaba Group & University of Sci- (Bao et al., 2022)
ence and Technology of China & CT Lab, University
of Macau, China & National University of Singapore,
Republic of Singapore
BJTU-Toshiba Beijing Jiaotong University, China & Toshiba Co., Ltd. (Huang et al., 2022)
HW-TSC Huawei Translation Services Center & Nanjing Univer- (Su et al., 2022)
sity, China
HyperMT - aiXplain aiXplain –
IST-Unbabel INESC-ID & Instituto de Telecomunicações & Instituto (Rei et al., 2022)
Superior Técnico & Unbabel, Portugal
KU X Upstage Korea University, Korea & Upstage (Eo et al., 2022)
NJUNLP Huawei Translation Services Center, China (Geng et al., 2022)
Papago Papago, Naver Corp (Lim and Park, 2022)
UCBerkeley-UMD University of California, Berkeley & University of Mary- (Mehandru et al., 2022)
land
UT-QE University of Tehran, Iran (Azadi et al., 2022)
Welocalize-ARC/NKUA Welocalize Inc, USA & National Kapodistrian Univer- (Zafeiridou and Sofianopoulos, 2022)
sity & Athena RC, Greece

Table 2: Participants to the WMT22 Quality Estimation shared task.

tokens distances in this metric were used training resources of the Metrics shared task.
as token-level scores. In order to alleviate
the mismatching issues, they also try to fine- Table 2 lists all participating teams submitting
tune the XLM-R model on word alignments systems to any of the tasks, and Table 3 report
from parallel corpora to make it represent the the number of successful submissions to each of
aligned words in different languages closer the sub-tasks and language pairs. Each team was
to each other, and use the fine-tuned model allowed up to ten submissions for each task variant
instead of XLM-R for scoring sentences and and language pair (with a limit of two submissions
tokens. per day). In the descriptions below, participation in
specific tasks is denoted by a task identifier (T1 =
Task 1, T2 = Task 2, T3 = Task 3).
Welocalize-ARC/NKUA (T1-DA): Welocalize-
ARC/NKUA’s submission for the Task 1 5 Results
follows the Predictor-Estimator framework
(Kim et al., 2017) with a regression head In this section, we present and discuss the results
on top to estimate the z-standardised DA. of our shared task. Please note that for all the
More specifically, they use a pre-trained three subtasks we used statistical significance test-
Transformer for feature extraction and then ing with p = 0.05.
concatenate the extracted features with 5.1 Task 1
additional glass-box features. The glass-box
As we have seen in Task 1 description (§2.1.1),
features are also produced using pre-trained
submissions are evaluated against the true z-
models and by applying multiple techniques
normalised sentence scores using Spearman’s rank
to estimate different types of uncertainty for
correlation coefficient ρ along with the following
each translated sentence. The final features
secondary metrics: Pearson’s correlation coeffi-
are then used as input for the QE regression
cient, r, Mean Absolute Error (MAE), and Root
model, which is a simple sequential Neural
Mean Squared Error (RMSE). Nonetheless, the fi-
Network with a linear output layer. Finally,
nal ranking between systems is calculated us-
the performance of the model is optimised
ing the primary metric only (Spearman’s ρ).
by employing Monte Carlo Dropout during
Also, statistical significance was computed using
both training and inference. Regarding the
William’s test.9
data, they use only the provided datasets
For the Task 1 word-level task, the submissions
(the MLQE-PE train/dev sets along with
are ranked using the Matthews correlation coeffi-
the additional dataset for Marathi language)
as well as some of the provided additional 9
https://github.com/ygraham/mt-qe-eval

79
Task/LP # submission in quality estimation, but can constitute a limita-
Task 1 – Sent-level Direct Assessment 161 tion for performance in truly multi-lingual scenar-
Multilingual w/o En-Yo 21
Multilingual w En-Yo 23
ios where the target languages are not seen during
English-Marathi 24 pre-training. Additionally, many final submissions
English-Czech 33 consisted of ensembles combining different large
English-Japanese 22
Pashto-English 16 pretrained models increasing even further the total
Khmer-English 22 number of model parameters.
Task 1 – Sent-level MQM 402
Multilingual 38 Another trend that seems to carry on from pre-
English-German 65 vious editions of the task is the incorporation of
English-Russian 62
Chinese-English 76 additional features in QE models (glass-box fea-
Task 1 – Word-level 247 tures were incorporated in Alibaba’s DA systems
Multilingual w/o En-Yo 18 while linguistic features were incorporated in aiX-
Multilingual w En-Yo 17
English-Czech 32 plain QE system), however in this edition such ap-
English-Japanese 27 proaches were outperformed by models that put
English-Marathi 24
Pashto-English 13
more emphasis on pre-training, using auxiliary
Khmer-English 28 tasks and external data.
English-German 28
English-Russian 18 For the sentence-level sub-tasks, participants
Chinese-English 27 managed to achieve high correlations for the major-
Task 2 – Explainable QE 161
English-Czech 14 ity of language pairs, especially for the DA origi-
English-Japanese 14 nated data, with the exception of En-Ja. The results
English-Marathi 13 show an improvement compared to the last edition,
Pashto-English 30
Khmer-English 25 although it is hard to draw a direct comparison due
English-German 17 to changes in the available train/development data.
English-Russian 12 However, it is interesting to note that performance
Chinese-English 12
English-Yoruba 12 for En-Mr, for which we provided considerable
Task 3 – Sent-Level Critical Error Det. 20 more data than for the other language pairs is still
Constrained
English-German 2
in the same range as results for the other language
Portuguese-English 2 pairs. It would thus be interesting to investigate fur-
Unconstrained ther which properties render a language pair harder
English-German 10
Portuguese-English 6
to evaluate.
Total 991 For the MQM data the overall correlations
achieved were lower in comparison to the DA ones
Table 3: Number of submissions to each sub-task and although still meaningful. Note that compared to
language-pair at the WMT22 Quality Estimation shared the DA data, the MQM language pairs were high-
task. resource ones, which could also influence perfor-
mance. Additionally, small discrepancies between
the annotation guidelines in the train set and the
cient (MCC). F1-scores are provided as comple- dev/test sets could have further complicated the
mentary information only and statistical signif- task. We intend to further investigate the MQM
icance was computed using randomisation tests potential in future editions, with the addition of
(Yeh, 2000) with Bonferroni correction (Abdi, new language pairs and more annotated data.
2007) for each language pair. For the word-level subtask, IST-Unbabel,
The majority of participants implemented mul- NJUNLP and Papago tied at the top for most lan-
tilingual models and the top performing sys- guage pairs, and we can observe that correlations
tems adopted a multi-tasking approach, learning are moderate across language pairs (both DA and
the sentence- and word-level targets jointly (IST- MQM originated ones). It is important to note that
Unbabel, Papago, NJUNLP). It is important to note no team seems to have submitted predictions us-
that all participants relied on large pre-trained en- ing a word-level only supervision; instead all the
coders (XLM-R, RemBERT, BERT, ELECTRA), participants of this task used a multi-task approach,
which seems to be the norm for high-performance learning jointly word and sentence level scores.
80
Model Multi Multi (w/o En-Yo) En-Cs En-Ja En-Mr Km-En Ps-En
IST-Unbabel 0.572 0.605 0.655 0.385 0.592 0.669 0.722
Papago 0.502 0.571 0.636 0.327 0.604 0.653 0.671
Alibaba Translate – 0.585 0.635 0.348 0.597 0.657 0.697
Welocalize-ARC/NKUA 0.448 0.506 0.563 0.276 0.444 0.623 –
BASELINE 0.415 0.497 0.560 0.272 0.436 0.579 0.641
lp_sunny‡ 0.414 0.485 0.511 0.290 0.395 0.611 0.637
HW-TSC – – 0.626 0.341 0.567 0.509 0.661
aiXplain – – 0.477 0.274 0.493 – –
NJUNLP – – – – 0.585 – –
UCBerkeley-UMD* – – 0.285 – – – –

Table 4: Spearman correlation with Direct Assessments for the submissions to WMT22 Quality Estimation Task 1.
For each language pair, results marked in bold correspond to the winning submissions, as they are not significantly
outperformed by any other system according to the Williams Significance Test (Williams, 1959). Baseline systems
are highlighted in grey; ‡ indicates Codalab username of participants from whom we have not received further
information and * indicates late submissions that were not considered for the official ranking of participating systems

Model Multi En- En- Zh-En 5.2 Task 2


De Ru
IST-Unbabel 0.474 0.561 0.519 0.348
NJUNLP 0.468 0.635 0.474 0.296 Three teams participated in Task 2, IST-Unbabel,
Alibaba-Translate 0.456 0.550 0.505 0.347 HW-TSC and UT-QE. IST-Unbabel participated
Papago 0.449 0.582 0.496 0.325 in all 9 language pairs, HW-TSC in all languages
lp_sunny ‡ 0.415 0.495 0.453 0.298
BASELINE 0.317 0.455 0.333 0.164 pairs except English-Yoruba, and UT-QE only in
BJTU-Toshiba – 0.621 0.434 0.299 Khmer-English and Pashto-English. As shown in
HW-TSC – 0.494 0.433 0.369
aiXplain – 0.376 0.338 0.194
Table 7, IST-Unbabel wins 7 of 9 LPs according to
pu_nlp ‡ – 0.611 – – the metric Recall at Top-K, HW-TSC the remaining
2. With Bonferroni correction, IST-Unbabel wins
Table 5: Spearman correlation with MQM for the sub- 4 LPs, HW-TSC wins 2, and both are indistinguish-
missions to WMT22 Quality Estimation Task 1. For
able on the remaining 3 LPs. Average precision
each language pair, results marked in bold correspond
to the winning submissions, as they are not significantly
(AP) yields identical results as Recall at Top-K in
outperformed by any other system according to the terms of ranking of the teams. There is one dif-
Williams Significance Test (Williams, 1959). Baseline ference according to the metric AUC in terms of
systems are highlighted in grey; ‡ indicates Codalab winners: HW-TSC wins English-Japanese. Finally,
username of participants from whom we have not re- all participating teams beat both baselines in all
ceived further information. cases.
For sentence-level performance (see Appendix
D), IST-Unbabel wins all LPs according to Pear-
son’s correlation and all LPs according to Spear-
man’s correlation except for Khmer-English, which
Best performers The scores in Tables 4 - 6 show HW-TSC wins. Not all teams beat all baselines in
the participant scores for the main metric, ordered terms of sentence-level performance.
by the best performance in the multilingual sub- The winning teams obtain the lowest sentence-
tasks. IST-Unbabel is the clear winner for the mul- level correlations for English-Chinese, English-
tilingual subtasks, but for the individual language Japanese and English-Yoruba and the highest cor-
pairs results vary and multiple participants are tied relations for Khmer-English and English-German.
at the top. All top-performing approaches (IST- This may be related to the quality of annotations
Unbabel, Papago, NJUNLP and Alibaba) share and the quality of MT systems involved. For word-
some common characteristics: (1) they constitute level explainability scores, the lowest Recall at
multilingual and multi-task approaches; (2) they Top-K scores are obtained for English-Yoruba and
use external data during pre-training, either adapted English-Marathi, whereas the highest scores are ob-
from other tasks (such as the Metrics task (Freitag tained for Pashto-English and Khmer-English. The
et al., 2022)) or generated artificially (pseudo data); fact that the winning systems obtain low sentence
and (3) they use ensembling for the final submis- and word-level scores for English-Yoruba and high
sion. scores for Khmer-English may indicate that the
81
Model Multi Multi (w/o En-Yo) En-Cs En-Ja En-Mr Kh-En Ps-En En-De En-Ru Zh-En
IST-Unbabel 0.341 0.361 0.436 0.238 0.392 0.425 0.424 0.303 0.427 0.360
Papago 0.317 0.343 0.396 0.257 0.418 0.429 0.374 0.319 0.421 0.351
BASELINE 0.235 0.257 0.325 0.175 0.306 0.402 0.359 0.182 0.203 0.104
HW-TSC – 0.218 0.424 0.258 0.351 0.353 0.358 0.274 0.343 0.246
NJUNLP – – – – 0.412 0.421 – 0.352 0.390 0.308

Table 6: Matthew Correlation Coefficient (MCC) for the submissions to WMT22 Quality Estimation Task 1
(word-level). For each language pair, results marked in bold correspond to the winning submissions, as they are
not significantly outperformed by any other system based on randomisation tests with Bonferroni correction (Yeh,
2000). Baseline systems are highlighted in grey.

Model En-Cs En-Ja En-Mr En-Ru En-De En-Yo Km-En Ps-En Zh-En
IST-Unbabel 0.561 0.466 0.317 0.390 0.365 0.234 0.665 0.672 0.379
HW-TSC 0.536 0.462 0.280 0.313 0.252 – 0.686 0.715 0.220
BASELINE (OpenKiwi+LIME) 0.417 0.367 0.194 0.135 0.074 0.111 0.580 0.615 0.048
BASELINE (Random) 0.363 0.336 0.167 0.148 0.124 0.144 0.565 0.614 0.093
UT-QE – – – – – – 0.622 0.668 –

Table 7: Recall at Top-K for the submissions to the WMT22 Quality Estimation Task 2 (Explainable QE). For
each language pair, results marked in bold correspond to the winning submissions, as they are not significantly
outperformed by any other system based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline
systems are highlighted in grey.

Model En-De En-De Pt-En Pt-En In the constrained setting, a single submission
(Cons) (UN- (Cons) (UN-
cons) cons) was received: IST-Unbabel. Their system outper-
KU X Upstage – 0.964 – 0.984 formed the baseline on both language pairs.
IST-Unbabel 0.564 – 0.721 –
BASELINE 0.074 0.855 -0.001 0.934
aiXplain – 0.219 – 0.179 6 Discussion
Table 8: Matthews Correlation Coefficient (MCC) for In what follows, we discuss the main findings of
the submissions to WMT21 Quality Estimation Task
this year’s shared task based on the goals we had
3 (Critical Error Detection). For each language pair,
results marked in bold correspond to the winning sub- previously identified for it.
missions, as they are not significantly outperformed
by any other system based on randomisation tests with General progress Participating systems
Bonferroni correction (Yeh, 2000). Baseline systems achieved very promising results for most
are highlighted in grey. languages, including the newly introduced
language-pairs as well as the new annotation
style (MQM). The best performing submissions
tasks are correlated (as one may intuitively expect):
showed moderate to strong correlation for
a QE system that yields better sentence-level scores
sentence-level DA and MQM prediction tasks.
also highlights word-level errors more correctly.
While it is hard to draw direct comparisons with
the previous editions, the overall correlation scores
5.3 Task 3
obtained are similar or improved for the common
In this task, we divide participants into uncon- language-pairs. In combination with the outcomes
strained and constrained settings, and address each of previous editions, it seems that multi-lingual and
group in separate. As in the last year, this task at- multi-task systems that are able to take advantage
tracted few participants, which we attribute to the of multiple resources, are showing better and more
recentness of the task. robust results. However, the word-level quality
In the unconstrained setting, there are two par- prediction is still a challenging task and there
ticipants: KU X Upstage and HyperMT - aiXplain. is ample room for improvement. Along the
The first achieved very high values for the mea- same lines, further exploring explainability tasks,
sured metrics, and is the best performer for this that support the sentence level predictions with
setting for both language pairs. The second ob- word level scores seems a promising path to
tained lower values, falling below the baseline on motivate finer-grained approaches to word-level
both language pairs. quality annotations.
82
DA vs MQM annotations To further understand unsupervised approaches.
the observed discrepancies between top perfor-
mances in the DA and MQM sub-tasks for sentence- Explainable quality estimation The perfor-
level quality estimation, we analyse the distribu- mance of the baselines in Task 2 suggests that ap-
tions of predicted scores vs gold scores for each plying a model-agnostic explanation method (i.e.,
language pair, as presented in Figure 2. LIME) to a relatively good sentence-level QE sys-
We can see in the scatter plots that there are mul- tem (i.e., OpenKiwi) straightforwardly may not re-
tiple test-segments which are annotated as perfect sult in plausible explanations. In particular, the
translations (maximum possible normalised MQM OpenKiwi+LIME baseline got higher Recall at
score), which fail to be classified accordingly as Top-K than the random baseline for only 5 LPs.
indicated by the top parts of the MQM scatter plots Using randomisation tests with Bonferroni correc-
in Figure 2. Overall, even with DA annotations we tion, we found that the OpenKiwi+LIME base-
can see that language pairs with more balanced line can significantly outperform the random base-
distribution between high and low quality seg- line for only 2 LPs (En-Cs and En-Ja). Despite
ments (Km-En, Ps-En) are those for which QE its higher Pearson’s correlation at the sentence
systems obtain better correlations, compare to level, OpenKiwi+LIME yielded random-like (or
more skewed language pairs (En-Mr, En-Ja). even worse) explanations for MQM language pairs.
Additionally, we can see that the MQM scores This also calls for a stronger baseline for the fu-
are significantly skewed towards higher scores, ture edition of the QE shared task. Additional sig-
with long-tails of few very low quality instances. nals/heuristics might be added to the future shared
This provides motivation to revisit the quantifica- task’s baselines such as sparsity of the rationales
tion of MQM annotations to generate sentence level (as used by IST-Unbabel) and alignments between
scores and further experiments into consolidating source and target sentences (as used by HW-TSC
MQM annotations from different annotators. Fur- and UT-QE).
thermore, perhaps providing access to finer-grained
Critical error detection. By comparing the per-
MQM annotations (using the category or severity
formance of the submitted systems, in particular
labels as targets) could aid in obtaining more mean-
the baselines, we see that the difficulty of the con-
ingful outcomes. In future editions we intend to
strained setting is much higher. We attribute this
further expand the coverage of languages for MQM
discrepancy to the fact that the artificially generated
annotations that will allow us to draw further con-
data follows a specific set of patterns, which can be
clusions and push the state-of-the-art further in this
captured by current methods when given enough
track.
examples. The HyperMT - aiXplain submission
Zero shot predictions We found that even with- seems to be an exception. However, although this
out development data or prior knowledge about system is unconstrained, it is composed of fine-
the language pair, the systems that submitted tuned decision trees where the base features are
predictions for En-Yo still achieved meaning- constrained. We consider that these features are
ful correlations. For the quality assessment and unable to provide sufficient information for the
explainability tasks, the achieved correlations are decision trees to be able to identify critical errors,
lower compared to the “seen” language pairs, but even when fine-tuned on the provided training data.
still comparable. We can also observe the scatter Due to the scarcity of annotated data containing
plot distributions that show the correlation obtained critical errors, we argue that the constrained setting
by the top performing system that is comparable presents a much more realist challenge, where sys-
with the other DA distributions. tems are trained for correlating with human judge-
However, we noticed that the availability of ments but are tested for robustness to critical errors.
the zero-shot languages in the frequently used For a future edition of this task, we envision a
pretrained encoders posed an additional chal- design that simultaneously considers both corre-
lenge for the participants as the performance on lations with human judgements and robustness to
En-Yo seemed dependent on whether the pretrained critical errors when evaluating a QE system. This
language model had seen Yoruba text during pre- can be combined with Task 1, where besides the
training. In future editions, we hope that mixing current evaluation method, participants would also
different zero-shot languages will further motivate receive a robustness score for their systems, mea-
83
Figure 2: Scatter plots for the predictions against true DA/MQM scores for the top-performing system for each
language pair. The histograms show the corresponding marginal distributions of predicted and true scores.

84
sured on a test set with critical errors. We hope that Farhad Akhbardeh, Arkady Arkhangorodsky, Mag-
this configuration would both attract more partici- dalena Biesialska, Ondřej Bojar, Rajen Chatter-
jee, Vishrav Chaudhary, Marta R. Costa-jussa,
pants to this task (as it would not required training
Cristina España-Bonet, Angela Fan, Christian Fe-
a specific system for critical error detection) and dermann, Markus Freitag, Yvette Graham, Ro-
further motivate the treatment of critical errors in man Grundkiewicz, Barry Haddow, Leonie Harter,
the development of QE systems. Kenneth Heafield, Christopher Homan, Matthias
Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai,
Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp
7 Conclusions Koehn, Nicholas Lourie, Christof Monz, Makoto
Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki
This year’s edition of the QE Shared Task in- Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au-
troduced a number of new elements: new low- guste Tapo, Marco Turchi, Valentin Vydrin, and Mar-
resource language pairs (Marathi and Yoruba), new cos Zampieri. 2021. Findings of the 2021 conference
on machine translation (WMT21). In Proceedings of
annotation conventions for sentence and word level the Sixth Conference on Machine Translation, pages
quality (MQM), new test sets, and new versions of 1–88, Online. Association for Computational Linguis-
explainability and critical error detection subtasks. tics.
The tasks attracted a steady number of participating
Duarte M. Alves, Ricardo Rei, Ana C. Farinha, José
teams and we believe the overall results are a great G. C. de Souza, and André F. T. Martins. 2022. Ro-
reflection of the state-of-the-art in QE. bust MT evaluation with Sentence-level Multilingual
We have made the gold labels and all submis- data Augmentation. In Proceedings of the Seventh
Conference on Machine Translation, Abu Dhabi. As-
sions to all tasks available for those interested in
sociation for Computational Linguistics.
further analysing the results, while newly inter-
ested participants can still access the competition Chantal Amrhein and Rico Sennrich. 2022. Identifying
instances on codalab and directly compare their Weaknesses in Machine Translation Metrics Through
Minimum Bayes Risk Decoding: A Case Study for
performance to other models. We aspire for the COMET. In Proceedings of the 2nd Conference
future editions to continue the efforts set in this of the Asia-Pacific Chapter of the Association for
and previous years and expand the resources and Computational Linguistics and the 12th International
coverage of QE, while further exploring recent and Joint Conference on Natural Language Processing,
Online. Association for Computational Linguistics.
more challenging subtasks such as fine-grained QE,
explainable QE and critical error detection. Fatemeh Azadi, Heshaam Faili, and Mohammad Javad
Dousti. 2022. Mismatching-Aware Unsupervised
Translation Quality Estimation for Low-Resource
Acknowledgments Languages. arXiv preprint arXiv:2208.00463.
Ricardo Rei and José G. C. de Souza are sup- Keqin Bao, Yu Wan, Dayiheng Liu, Baosong Yang,
ported by the P2020 program (MAIA: contract Wenqiang Lei, Xiangnan He, Derek F. Wong, and
045909) and by European Union’s Horizon Eu- Jun Xie. 2022. Alibaba-translate china’s submis-
sion for wmt 2022 quality estimation shared task. In
rope Research and Innovation Actions (UTTER: Proceedings of the Seventh Conference on Machine
contract 101070631) Translation, Abu Dhabi. Association for Computa-
André Martins and Chrysoula Zerva are sup- tional Linguistics.
ported by the P2020 program (MAIA: contract Loïc Barrault, Magdalena Biesialska, Ondřej Bo-
045909), by the European Research Council (ERC jar, Marta R. Costa-jussà, Christian Federmann,
StG DeepSPIN 758969), and by the Fundação Yvette Graham, Roman Grundkiewicz, Barry Had-
para a Ciência e Tecnologia through contract dow, Matthias Huck, Eric Joanis, Tom Kocmi,
Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof
UIDB/50008/2020. Monz, Makoto Morishita, Masaaki Nagata, Toshi-
Marina Fomicheva and Lucia Specia were sup- aki Nakazawa, Santanu Pal, Matt Post, and Marcos
ported by funding from the Bergamot project (EU Zampieri. 2020. Findings of the 2020 conference on
H2020 Grant No. 825303). machine translation (WMT20). In Proceedings of
the Fifth Conference on Machine Translation, pages
1–55, Online. Association for Computational Linguis-
tics.
References
Valerio Basile, Federico Cabitza, Andrea Campagner,
Hervé Abdi. 2007. The bonferroni and šidák corrections and Michael Fell. 2021. Toward a perspectivist turn
for multiple comparisons. Encyclopedia of measure- in ground truthing for predictive computing. arXiv
ment and statistics, 3:103–107. preprint arXiv:2109.04270.

85
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham human evaluation for machine translation. Transac-
Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, tions of the Association for Computational Linguis-
Heyan Huang, and Ming Zhou. 2021. InfoXLM: An tics, 9:1460–1474.
information-theoretic framework for cross-lingual
language model pre-training. In Proceedings of the Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo
2021 Conference of the North American Chapter of George Foster, Craig Stewart, Tom Kocmi, Elefthe-
the Association for Computational Linguistics: Hu- rios Avramidis, Alon Lavie, and André F. T. Mar-
man Language Technologies, pages 3576–3588, On- tins. 2022. Results of the WMT22 Metrics Shared
line. Association for Computational Linguistics. Task. In Proceedings of the Seventh Conference on
Machine Translation, Abu Dhabi. Association for
George Chrysostomou and Nikolaos Aletras. 2022. An Computational Linguistics.
empirical study on explanations in out-of-domain
settings. In Proceedings of the 60th Annual Meet- Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo,
ing of the Association for Computational Linguistics Craig Stewart, George Foster, Alon Lavie, and Ondřej
(Volume 1: Long Papers), pages 6920–6938, Dublin, Bojar. 2021b. Results of the WMT21 metrics shared
Ireland. Association for Computational Linguistics. task: Evaluating metrics with expert-based human
evaluations on TED and news domain. In Proceed-
Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin ings of the Sixth Conference on Machine Translation,
Johnson, and Sebastian Ruder. 2021. Rethinking Em- pages 733–774, Online. Association for Computa-
bedding Coupling in Pre-trained Language Models. tional Linguistics.
In International Conference on Learning Representa-
tions. Xiang Geng, Yu Zhang, Shujian Huang, Shimin Tao,
Hao Yang, and Jiajun Chen. 2022. NJUNLP’s Partic-
Qu Cui, Shujian Huang, Jiahuan Li, Xiang Geng, Zaixi- ipation for the WMT2022 Quality Estimation Shared
ang Zheng, Guoping Huang, and Jiajun Chen. 2021. Task. In Proceedings of the Seventh Conference on
Directqe: Direct pretraining for machine translation Machine Translation, Abu Dhabi. Association for
quality estimation. In Proceedings of the AAAI Con- Computational Linguistics.
ference on Artificial Intelligence, volume 35, pages
12719–12727. Yvette Graham, Timothy Baldwin, Alistair Moffat, and
Justin Zobel. 2016. Can machine translation systems
David M Eberhard, Gary F Simons, and Charles D be evaluated by the crowd alone. Natural Language
Fennig. 2020. Ethnologue: Languages of the world Engineering, FirstView:1–28.
(2020). URL: https://www. ethnologue. com/(visited
on Apr. 11, 2020)(cit. on p. 14). Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan
Pino, Guillaume Lample, Philipp Koehn, Vishrav
Sugyeong Eo, Chanjun Park, Hyeonseok Moon, Jae- Chaudhary, and Marc’Aurelio Ranzato. 2019. The
hyung Seo, and Heuiseok Lim. 2022. KU X Up- FLORES evaluation datasets for low-resource ma-
stage’s submission for the WMT22 Quality Esti- chine translation: Nepali–English and Sinhala–
mation: Critical Error Detection Shared Task. In English. In Proceedings of the 2019 Conference on
Proceedings of the Seventh Conference on Machine Empirical Methods in Natural Language Processing
Translation, Abu Dhabi. Association for Computa- and the 9th International Joint Conference on Natu-
tional Linguistics. ral Language Processing (EMNLP-IJCNLP), pages
6098–6111, Hong Kong, China. Association for Com-
Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei putational Linguistics.
Zhao, Steffen Eger, and Yang Gao. 2021. The
Eval4NLP shared task on explainable quality esti- Hui Huang, Hui Di, Chunyou Li, Hanming Wu,
mation: Overview and results. In Proceedings of Kazushige Oushi, Yufeng Chen, Jian Liu, and Jin’an
the 2nd Workshop on Evaluation and Comparison Xu. 2022. BJTU-Toshiba’s Submission to WMT22
of NLP Systems, pages 165–178, Punta Cana, Do- Quality Estimation Shared Task. In Proceedings of
minican Republic. Association for Computational the Seventh Conference on Machine Translation, Abu
Linguistics. Dhabi. Association for Computational Linguistics.

Marina Fomicheva, Shuo Sun, Erick Fonseca, Diptesh Kanojia, Marina Fomicheva, Tharindu Ranas-
Chrysoula Zerva, Frédéric Blain, Vishrav Chaudhary, inghe, Frédéric Blain, Constantin Orăsan, and Lucia
Francisco Guzmán, Nina Lopatina, Lucia Specia, and Specia. 2021. Pushing the right buttons: Adversarial
André F. T. Martins. 2022. MLQE-PE: A multilin- evaluation of quality estimation. In Proceedings of
gual quality estimation and post-editing dataset. In the Sixth Conference on Machine Translation, pages
Proceedings of the Thirteenth Language Resources 625–638, Online. Association for Computational Lin-
and Evaluation Conference, pages 4963–4974, Mar- guistics.
seille, France. European Language Resources Asso-
ciation. Fabio Kepler, Jonay Trénous, Marcos Treviso, Miguel
Vera, and André F. T. Martins. 2019. OpenKiwi:
Markus Freitag, George Foster, David Grangier, Viresh An open source framework for quality estimation.
Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021a. In Proceedings of the 57th Annual Meeting of the
Experts, errors, and context: A large-scale study of Association for Computational Linguistics: System

86
Demonstrations, pages 117–122, Florence, Italy. As- 2021. Are references really needed? unbabel-IST
sociation for Computational Linguistics. 2021 submission for the metrics shared task. In Pro-
ceedings of the Sixth Conference on Machine Trans-
Hyun Kim, Jong-Hyeok Lee, and Seung-Hoon Na. 2017. lation, pages 1030–1040, Online. Association for
Predictor-estimator using multilevel task learning Computational Linguistics.
with stack propagation for neural quality estimation.
In Proceedings of the Second Conference on Machine Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Translation, pages 562–568. Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat- on Empirical Methods in Natural Language Process-
tacharyya. 2018. The IIT Bombay English-Hindi ing (EMNLP), pages 2685–2702, Online. Association
parallel corpus. In Proceedings of the Eleventh In- for Computational Linguistics.
ternational Conference on Language Resources and
Evaluation (LREC 2018), Miyazaki, Japan. European Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro,
Language Resources Association (ELRA). Chrysoula Zerva, Ana C. Farinha, Christine Maroti,
José G. C. de Souza, Taisiya Glushkova, Duarte
Seunghyun S. Lim and Jeonghyeok Park. 2022. Pa- Alves, Alon Lavie, Luisa Coheur, and André F. T.
pago’s submission to the wmt22 quality estimation Martins. 2022. CometKiwi: IST-Unbabel 2022 Sub-
shared task. In Proceedings of the Seventh Confer- mission for the Quality Estimation Shared Task. In
ence on Machine Translation, Abu Dhabi. Associa- Proceedings of the Seventh Conference on Machine
tion for Computational Linguistics. Translation, Abu Dhabi. Association for Computa-
tional Linguistics.
Arle Lommel, Aljoscha Burchardt, Maja Popović, Kim
Harris, Eleftherios Avramidis, and Hans Uszkoreit. Marco Tulio Ribeiro, Sameer Singh, and Carlos
2014. Using a new analytic measure for the annota- Guestrin. 2016. “why should i trust you?” explaining
tion and analysis of mt errors on real data. In Pro- the predictions of any classifier. In Proceedings of
ceedings of the 17th Annual conference of the Eu- the 22nd ACM SIGKDD international conference on
ropean Association for Machine Translation, pages knowledge discovery and data mining, pages 1135–
165–172. 1144.
Christopher D. Manning, Prabhakar Raghavan, and Hin-
rich Schütze. 2008. Introduction to Information Re- Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
trieval. Cambridge University Press, USA. nea Micciulla, and John Makhoul. 2006. A study
of translation edit rate with targeted human annota-
Nikita Mehandru, Marine Carpuat, and Niloufar Selehi. tion. In Proceedings of the 7th Conference of the
2022. Quality Estimation by Backtranslation at the Association for Machine Translation in the Americas:
WMT 2022 Quality Estimation Task. In Proceedings Technical Papers, pages 223–231.
of the Seventh Conference on Machine Translation,
Abu Dhabi. Association for Computational Linguis- Lucia Specia, Frédéric Blain, Marina Fomicheva,
tics. Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary,
and André F. T. Martins. 2021. Findings of the WMT
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, 2021 shared task on quality estimation. In Proceed-
Sam Gross, Nathan Ng, David Grangier, and Michael ings of the Sixth Conference on Machine Translation,
Auli. 2019. fairseq: A Fast, Extensible Toolkit for pages 684–725, Online. Association for Computa-
Sequence Modeling. In Proceedings of the 2019 tional Linguistics.
Conference of the North American Chapter of the
Association for Computational Linguistics (Demon- Chang Su, Miaomiao Ma, Shimin Tao, Hao Yang, Min
strations), pages 48–53, Minneapolis, Minnesota. As- Zhang, Xiang Geng, Shujian Huang, Jiaxin Guo,
sociation for Computational Linguistics. Minghan Wang, and Yinglu Li. 2022. CrossQE: HW-
TSC 2022 Submission for the Quality Estimation
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Shared. In Proceedings of the Seventh Conference
Bheemaraj, Mayank Jobanputra, Raghavan AK, on Machine Translation, Abu Dhabi. Association for
Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Ma- Computational Linguistics.
halakshmi J, Divyanshu Kakwani, Navneet Kumar,
Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Y. Tang, C. Tran, Xian Li, P. Chen, Naman Goyal,
Vivek Raghavan, Anoop Kunchukuttan, Pratyush Ku- Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2020.
mar, and Mitesh Shantadevi Khapra. 2022. Samanan- Multilingual translation with extensible multilingual
tar: The Largest Publicly Available Parallel Corpora pretraining and finetuning. ArXiv, abs/2008.00401.
Collection for 11 Indic Languages. Transactions
of the Association for Computational Linguistics, Jörg Tiedemann. 2012. Parallel data, tools and inter-
10:145–162. faces in OPUS. In Proceedings of the Eighth In-
ternational Conference on Language Resources and
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan Evaluation (LREC’12), pages 2214–2218, Istanbul,
van Stigt, Craig Stewart, Pedro Ramos, Taisiya Turkey. European Language Resources Association
Glushkova, André F. T. Martins, and Alon Lavie. (ELRA).

87
Yu Wan, Dayiheng Liu, Baosong Yang, Haibo Zhang,
Boxing Chen, Derek Wong, and Lidia Chao. 2022.
UniTE: Unified translation evaluation. In Proceed-
ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 8117–8127, Dublin, Ireland. Association
for Computational Linguistics.
Jiayi Wang, Ke Wang, Boxing Chen, Yu Zhao, Weihua
Luo, and Yuqi Zhang. 2021. QEMind: Alibaba’s sub-
mission to the WMT21 quality estimation shared task.
In Proceedings of the Sixth Conference on Machine
Translation, pages 948–954, Online. Association for
Computational Linguistics.
Evan J. Williams. 1959. Regression Analysis, vol-
ume 14. Wiley, New York, USA.
Alexander Yeh. 2000. More Accurate Tests for the
Statistical Significance of Result Differences. In
Coling-2000: the 18th Conference on Computational
Linguistics, pages 947–953, Saarbrücken, Germany.
Eirini Zafeiridou and Sokratis Sofianopoulos. 2022.
Welocalize-ARC/NKUA’s Submission to the WMT
2022 Quality Estimation Shared Task. In Proceed-
ings of the Seventh Conference on Machine Trans-
lation, Abu Dhabi. Association for Computational
Linguistics.

88
A Official Results of the WMT22 Quality Estimation Task 1 (Direct Assessment)
Tables 9, 10, 11, 12, 13, 14 and 15 show the results for all language pairs and the multilingual variants,
ranking participating systems best to worst using Spearman correlation as primary key for each of these
cases.
Model Spearman RMSE MAE Disk footprint (B) # Model params
• IST-Unbabel 0.572 0.689 0.539 2,260,735,089 583,891,109
Papago 0.502 2.404 2.077 2,243,044,839 560,713,447
Welocalize-ARC/NKUA 0.448 0.794 0.632 2,307,101,417 576,733,248
BASELINE 0.415 0.979 0.820 2,280,011,066 564,527,011
lp_sunny ‡ 0.414 1.054 0.898 2,356,736,392 580,792,183

Table 9: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the Multilingual variant.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system according to
the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates Codalab
usernames of participants from whom we have not received further information.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• IST-Unbabel 0.605 0.671 0.521 2,260,735,089 583,891,109
Alibaba Translate 0.587 0.675 0.533 2,191,440 560,981,507
Papago 0.571 1.793 1.451 2,243,044,839 560,713,447
Welocalize-ARC/NKUA 0.506 0.733 0.571 2,307,068,585 576,725,041
BASELINE 0.497 0.748 0.585 2,280,011,066 564,527,011
lp_sunny ‡ 0.485 0.757 0.596 2,356,736,392 580,792,183

Table 10: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the Multilingual (w/o
English-Yoruba) variant. Teams marked with "•" are the winners, as they are not significantly outperformed by any
other system according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in
grey. ‡ indicates Codalab usernames of participants from whom we have not received further information.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• IST-Unbabel 0.655 0.720 0.545 2,260,735,089 583,891,109
• Papago 0.636 1.371 1.081 2,243,044,839 560,713,447
Alibaba Translate 0.635 0.746 0.607 2,191,440 560,981,507
HW-TSC 0.626 0.712 0.545 540,868,112 222,353,517
Welocalize-ARC/NKUA 0.563 0.785 0.610 2,307,068,585 576,725,041
BASELINE 0.560 0.804 0.608 2,280,011,066 564,527,011
lp_sunny ‡ 0.511 0.786 0.614 2,356,736,392 580,792,183
aiXplain 0.477 0.825 0.679 745,679,835 12,345
UCBerkeley-UMD* 0.285 1.252 0.961 – 177,853,440

Table 11: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the English-Czech
dataset. Teams marked with "•" are the winners, as they are not significantly outperformed by any other system
according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates
Codalab usernames of participants from whom we have not received further information and * indicates late
submissions that were not considered for the official ranking of participating systems

89
Model Spearman RMSE MAE Disk footprint (B) # Model params
• IST-Unbabel 0.385 0.689 0.528 2,260,735,089 583,891,109
Alibaba Translate 0.348 0.673 0.522 2,191,440 560,981,507
HW-TSC 0.341 0.726 0.555 540,868,112 222,353,517
Papago 0.327 2.253 1.957 2,243,044,839 560,713,447
lp_sunny ‡ 0.290 0.718 0.556 2,356,736,392 580,792,183
Welocalize-ARC/NKUA 0.276 0.755 0.579 2,307,068,585 576,725,041
aiXplain 0.274 0.704 0.547 745,679,835 12,345
BASELINE 0.272 0.747 0.576 2,280,011,066 564,527,011

Table 12: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the English-Japanese
dataset. Teams marked with "•" are the winners, as they are not significantly outperformed by any other system
according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates
Codalab usernames of participants from whom we have not received further information.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• Papago 0.604 0.658 0.514 2,243,044,839 560,713,447
• Alibaba Translate 0.597 0.456 0.349 2,191,440 560,981,507
• IST-Unbabel 0.592 0.498 0.365 6,932,353,559 583,891,109
• NJUNLP 0.585 0.617 0.414 3,264,730,349 560,145,557
HW-TSC 0.567 0.506 0.372 222,353,517 540,868,112
aiXplain 0.493 0.540 0.396 745,679,835 12,345
Welocalize-ARC/NKUA 0.444 0.534 0.401 2,307,068,585 576,725,041
BASELINE 0.436 0.628 0.461 2,280,011,066 564,527,011
lp_sunny ‡ 0.395 0.570 0.443 2,356,736,392 580,792,183

Table 13: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the English-Marathi
dataset. Teams marked with "•" are the winners, as they are not significantly outperformed by any other system
according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates
Codalab usernames of participants from whom we have not received further information.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• IST-Unbabel 0.669 0.714 0.569 2,260,735,089 583,891,109
Alibaba Translate 0.657 0.778 0.596 2,191,440 560,981,507
Papago 0.653 2.786 2.291 2,243,044,839 560,713,447
Welocalize-ARC/NKUA 0.623 0.794 0.619 2,307,068,585 576,725,041
lp_sunny ‡ 0.611 0.784 0.621 2,356,736,392 580,792,183
BASELINE 0.579 0.774 0.616 2,280,011,066 564,527,011
HW-TSC 0.509 1.043 0.804 222,353,517 540,868,112

Table 14: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the Khmer-English
dataset. Teams marked with "•" are the winners, as they are not significantly outperformed by any other system
according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates
Codalab usernames of participants from whom we have not received further information.

90
Model Spearman RMSE MAE Disk footprint (B) # Model params
• IST-Unbabel 0.722 0.719 0.575 2,260,735,089 583,891,109
Alibaba Translate 0.697 0.720 0.594 2,191,440 560,981,507
Papago 0.671 0.763 0.646 2,243,044,839 560,713,447
HW-TSC 0.661 0.729 0.592 540,868,112 222,353,517
BASELINE 0.641 0.788 0.663 2,280,011,066 564,527,011
lp_sunny ‡ 0.637 0.954 0.775 2,356,736,392 580,792,183

Table 15: Official results of the WMT22 Quality Estimation Task 1 Direct Assessment for the Pashto-English
dataset. Teams marked with "•" are the winners, as they are not significantly outperformed by any other system
according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates
Codalab usernames of participants from whom we have not received further information.

91
B Official Results of the WMT22 Quality Estimation Task 1 (MQM)
Tables 16, 17, 18 and 19 show the results for all language pairs and the multilingual variant, ranking
participating systems best to worst using Spearman correlation as primary key for each of these cases.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• IST-Unbabel 0.474 0.973 0.559 2,260,735,089 583,891,109
NJUNLP 0.468 0.945 0.579 3,264,730,349 560,145,557
Alibaba Translate 0.456 0.855 0.493 2,260,733,079 565,137,999
Papago 0.449 1.332 0.990 2,243,044,839 560,713,447
lp_sunny ‡ 0.415 0.952 0.536 2,356,736,392 580,792,183
BASELINE 0.317 1.041 0.575 2,280,011,066 564,527,011

Table 16: Official results of the WMT22 Quality Estimation Task 1 MQM for the Multilingual variant. Baseline
systems are highlighted in grey. ‡ indicates Codalab usernames of participants from whom we have not received
further information.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• NJUNLP 0.635 0.838 0.594 3,264,730,349 560,145,557
• BJTU-Toshiba 0.621 0.818 0.545 2,239,711,849 559,893,507
pu_nlp ‡ 0.611 0.997 0.716 1,326,455,799 237,846,178
Papago 0.582 0.906 0.556 2,243,044,839 560,713,447
IST-Unbabel 0.561 0.854 0.521 2,260,743,851 565,139,485
Alibaba Translate 0.550 0.769 0.466 2,260,733,079 565,137,999
lp_sunny ‡ 0.495 0.875 0.534 2,356,736,392 580,792,183
HW-TSC 0.494 0.953 0.612 470,693,617 117,653,760
BASELINE 0.455 0.970 0.576 2,280,011,066 564,527,011
aiXplain 0.376 0.995 0.747 368,857,948 12,345

Table 17: Official results of the WMT22 Quality Estimation Task 1 MQM for the English-German dataset. Teams
marked with "•" are the winners, as they are not significantly outperformed by any other system according to
the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates Codalab
usernames of participants from whom we have not received further information.

Model Spearman RMSE MAE Disk footprint (B) # Model params


• IST-Unbabel 0.519 0.963 0.531 2,260,743,915 565,139,485
• Alibaba Translate 0.505 0.961 0.590 2,260,733,079 565,137,999
• Papago 0.496 1.428 1.126 2,243,044,839 560,713,447
• NJUNLP 0.474 0.997 0.666 3,264,730,349 560,145,557
lp_sunny ‡ 0.453 0.915 0.548 2,356,736,392 580,792,183
BJTU-Toshiba 0.434 1.011 0.659 2,239,711,849 559,893,507
HW-TSC 0.433 1.257 0.809 2,260,780,823 565,137,436
aiXplain 0.338 1.116 0.785 368,857,948 12,345
BASELINE 0.333 1.051 0.606 2,280,011,066 564,527,011

Table 18: Official results of the WMT22 Quality Estimation Task 1 MQM for the English-Russian dataset. Teams
marked with "•" are the winners, as they are not significantly outperformed by any other system according to
the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates Codalab
usernames of participants from whom we have not received further information.

92
Model Spearman RMSE MAE Disk footprint (B) # Model params
• HW-TSC 0.369 1.163 0.770 2,260,780,823 565,137,436
• IST-Unbabel 0.348 1.073 0.559 2,260,735,089 583,891,109
• Alibaba Translate 0.347 0.989 0.490 2,260,733,079 565,137,999
• Papago 0.325 0.980 0.397 2,243,044,839 560,095,633
• BJTU-Toshiba 0.299 1.128 0.612 1,736,199,083 434,015,235
lp_sunny ‡ 0.298 1.064 0.525 2,356,736,392 580,792,183
NJUNLP 0.296 0.999 0.476 3,264,730,349 560,145,557
aiXplain 0.194 1.481 1.079 368,857,948 12,345
BASELINE 0.164 1.102 0.543 2,280,011,066 564,527,011

Table 19: Official results of the WMT22 Quality Estimation Task 1 MQM for the Chinese-English dataset. Teams
marked with "•" are the winners, as they are not significantly outperformed by any other system according to
the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey. ‡ indicates Codalab
usernames of participants from whom we have not received further information.

93
C Official Results of the WMT22 Quality Estimation Task 1 (Word-level)
Tables 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29 show the results for all language pairs and the multilingual
variants, ranking participating systems best to worst using Matthews correlation coefficient (MCC) as
primary key for each of these cases.

Model MCC Recall Precision Disk footprint (B) # Model params


• IST-Unbabel 0.341 0.466 0.810 2,260,744,555 565,139,485
Papago 0.317 0.422 0.787 2,241,394,304 560,301,035
BASELINE 0.235 0.356 0.765 2,280,011,066 564,527,011

Table 20: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the Multilingual task. Teams
marked with "•" are the winners, as they are not significantly outperformed by any other system according to the
Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• IST-Unbabel 0.361 0.494 0.830 2,260,744,555 565,139,485
• Papago 0.343 0.451 0.858 2,241,394,304 560,301,035
BASELINE 0.257 0.378 0.838 2,280,011,066 564,527,011
HW-TSC 0.218 0.404 0.628 2,336,352,552 612,368,384

Table 21: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the Multilingual w/o English-
Yoruba task. Teams marked with "•" are the winners, as they are not significantly outperformed by any other
system according to the Williams Significance Test (Williams, 1959). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• IST-Unbabel 0.436 0.578 0.852 2,260,744,555 565,139,485
• HW-TSC 0.424 0.570 0.848 2,260,780,823 565,137,436
• Papago 0.396 0.549 0.739 2,240,570,795 560,095,834
BASELINE 0.325 0.426 0.870 2,280,011,066 564,527,011

Table 22: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the English-Czech dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• HW-TSC 0.258 0.497 0.728 2,260,780,823 565,137,436
• Papago 0.257 0.502 0.699 2,241,394,304 560,301,035
• IST-Unbabel 0.238 0.491 0.687 2,260,743,979 565,139,485
BASELINE 0.175 0.375 0.795 2,280,011,066 564,527,011

Table 23: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the English-Japanese dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based
on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey. “Rank”
indicates the averaged ranking of participants with regards to all metrics (including memory print and number of
parameters) and not the final shared task ranking which is decided according to MCC.

94
Model MCC Recall Precision Disk footprint (B) # Model params
• Papago 0.418 0.420 0.951 2,241,394,304 560,301,035
• NJUNLP 0.412 0.472 0.939 3,264,730,349 560,145,557
• IST-Unbabel 0.392 0.414 0.947 2,260,744,107 565,139,485
HW-TSC 0.351 0.428 0.917 2,260,780,823 565,137,436
BASELINE 0.306 0.282 0.946 2,280,011,066 564,527,011

Table 24: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the English-Marathi dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• Papago 0.429 0.762 0.660 2,241,394,304 560,301,035
• IST-Unbabel 0.425 0.779 0.555 2,260,744,107 565,139,485
• NJUNLP 0.421 0.744 0.677 3,264,730,349 560,145,557
BASELINE 0.402 0.769 0.567 2,280,011,066 564,527,011
HW-TSC 0.353 0.759 0.395 2,260,780,823 565,137,436

Table 25: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the Khmer-English dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• IST-Unbabel 0.424 0.691 0.733 2,260,744,107 565,139,485
Papago 0.374 0.646 0.723 2,241,394,304 560,301,035
BASELINE 0.359 0.695 0.628 2,280,011,066 564,527,011
HW-TSC 0.358 0.699 0.597 2,260,780,823 565,137,436

Table 26: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the Pashto-English dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• NJUNLP 0.352 0.351 0.980 3,264,730,349 560,145,557
• Papago 0.319 0.336 0.960 2,241,394,304 560,301,035
• IST-Unbabel 0.303 0.317 0.956 2,260,744,107 565,139,485
HW-TSC 0.274 0.292 0.954 2,260,780,823 565,137,436
BASELINE 0.182 0.213 0.970 2,280,011,066 564,527,011

Table 27: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the English-German dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• IST-Unbabel 0.427 0.468 0.958 2,260,743,915 565,139,485
• Papago 0.421 0.381 0.966 2,241,394,304 560,713,447
• NJUNLP 0.390 0.440 0.949 3,264,730,349 560,145,557
HW-TSC 0.343 0.396 0.945 2,260,780,823 565,137,436
BASELINE 0.203 0.144 0.960 2,280,011,066 564,527,011

Table 28: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the English-Russian dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

95
Model MCC Recall Precision Disk footprint (B) # Model params
• IST-Unbabel 0.360 0.327 0.966 2,260,743,915 565,139,485
• Papago 0.351 0.338 0.973 2,241,394,304 560,713,447
• NJUNLP 0.308 0.303 0.988 3,264,730,349 560,145,557
HW-TSC 0.246 0.181 0.910 2,260,780,823 565,137,436
BASELINE 0.104 0.123 0.965 2,280,011,066 564,527,011

Table 29: Official results of the WMT22 Quality Estimation Task 1 (word-level) for the Chinese-English dataset.
Teams marked with "•" are the winners, as they are not significantly outperformed by any other system based on
randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

96
D Official Results of the WMT22 Quality Estimation Task 2 (Explainable QE)
Tables 30, 31, 32, 33, 34, 35, 36, 37 and 38 show the results for all language pairs, ranking participating
systems best to worst using “Recall at Top-K” on target sentences as primary key for each of these cases.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.561 0.725 0.659 0.548 0.511
• HW-TSC 0.536 0.709 0.632 0.314 0.323
BASELINE (OpenKiwi+LIME) 0.417 0.537 0.500 0.342 0.352
BASELINE (Random) 0.363 0.493 0.453 0.011 0.016

Table 30: Official results of the WMT22 Quality Estimation Task 2 for the English-Czech dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.466 0.641 0.557 0.252 0.243
• HW-TSC 0.462 0.651 0.547 0.132 0.148
BASELINE (OpenKiwi+LIME) 0.367 0.509 0.451 0.202 0.217
BASELINE (Random) 0.336 0.503 0.418 0.028 0.019

Table 31: Official results of the WMT22 Quality Estimation Task 2 for the English-Japanese dataset. Teams
marked with "•" correspond to the winning submissions, as they are not significantly outperformed by any other
system based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in
grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.317 0.667 0.448 0.585 0.467
• HW-TSC 0.280 0.625 0.412 0.317 0.426
BASELINE (OpenKiwi+LIME) 0.194 0.479 0.310 0.336 0.372
BASELINE (Random) 0.167 0.489 0.296 0.043 0.017

Table 32: Official results of the WMT22 Quality Estimation Task 2 for the English-Marathi dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.390 0.747 0.511 0.416 0.459
HW-TSC 0.313 0.686 0.422 0.369 0.426
BASELINE (Random) 0.148 0.527 0.256 0.022 0.015
BASELINE (OpenKiwi+LIME) 0.135 0.428 0.230 0.252 0.330

Table 33: Official results of the WMT22 Quality Estimation Task 2 for the English-Russian dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

97
Model Word-level (Target sentence) Sentence-level
Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.365 0.776 0.490 0.559 0.553
HW-TSC 0.252 0.689 0.361 0.375 0.435
BASELINE (Random) 0.124 0.504 0.212 -0.049 -0.043
BASELINE (OpenKiwi+LIME) 0.074 0.442 0.172 0.370 0.414

Table 34: Official results of the WMT22 Quality Estimation Task 2 for the English-German dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.234 0.671 0.359 0.309 0.321
BASELINE (Random) 0.144 0.514 0.246 -0.086 -0.101
BASELINE (OpenKiwi+LIME) 0.111 0.442 0.218 0.085 0.160

Table 35: Official results of the WMT22 Quality Estimation Task 2 for the English-Yoruba dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• HW-TSC 0.686 0.720 0.751 0.601 0.610
IST-Unbabel 0.665 0.660 0.751 0.617 0.598
UT-QE 0.622 0.628 0.694 0.222 0.190
BASELINE (OpenKiwi+LIME) 0.580 0.520 0.653 0.417 0.430
BASELINE (Random) 0.565 0.498 0.633 -0.048 -0.045

Table 36: Official results of the WMT22 Quality Estimation Task 2 for the Khmer-English dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• HW-TSC 0.715 0.716 0.777 0.393 0.418
IST-Unbabel 0.672 0.612 0.740 0.593 0.601
UT-QE 0.668 0.643 0.727 0.409 0.402
BASELINE (OpenKiwi+LIME) 0.615 0.503 0.676 0.378 0.403
BASELINE (Random) 0.614 0.497 0.662 -0.002 0.002

Table 37: Official results of the WMT22 Quality Estimation Task 2 for the Pashto-English dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

Model Word-level (Target sentence) Sentence-level


Recall at Top-K AUC AP Pearson’s Spearman’s
• IST-Unbabel 0.379 0.785 0.475 0.103 0.190
HW-TSC 0.220 0.652 0.315 0.097 0.159
BASELINE (Random) 0.093 0.463 0.162 0.041 -0.010
BASELINE (OpenKiwi+LIME) 0.048 0.388 0.126 -0.007 0.159

Table 38: Official results of the WMT22 Quality Estimation Task 2 for the Chinese-English dataset. Teams marked
with "•" correspond to the winning submissions, as they are not significantly outperformed by any other system
based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are highlighted in grey.

98
E Official Results of the WMT22 Quality Estimation Task 3 (Critical Error Detection)
Tables 39, 40, 41 and 42 show the results for all language pairs and the multilingual variants, ranking
participating systems best to worst using Matthews correlation coefficient (MCC) as primary key for each
of these cases.
Model MCC Recall Precision Disk footprint (B) # Model params
• IST-Unbabel 0.564 0.619 0.619 2,260,735,025 565,137,435
BASELINE 0.074 0.191 0.191 2,277,430,785 569,330,715

Table 39: Official results of the WMT22 Quality Estimation Task 3 (Critical Error Detection) for the English-
German (Constrained) dataset. Teams marked with "•" are the winners, as they are not significantly outperformed
by any other system based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are
highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• KU X Upstage 0.964 0.968 0.968 2,244,861,551 559,890,432
BASELINE 0.855 0.873 0.873 2,260,734,129 565,137,435
aiXplain 0.219 0.318 0.318 2,052,963,739 12,345

Table 40: Official results of the WMT22 Quality Estimation Task 3 (Critical Error Detection) for the English-
German (UNconstrained) dataset. Teams marked with "•" are the winners, as they are not significantly outper-
formed by any other system based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems
are highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• IST-Unbabel 0.721 0.761 0.761 2,260,735,025 565,137,435
BASELINE -0.001 0.141 0.141 2,277,430,785 569,330,715

Table 41: Official results of the WMT22 Quality Estimation Task 3 (Critical Error Detection) for the Portuguese-
English (Constrained) dataset. Teams marked with "•" are the winners, as they are not significantly outperformed
by any other system based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are
highlighted in grey.

Model MCC Recall Precision Disk footprint (B) # Model params


• KU X Upstage 0.984 0.986 0.986 2,244,861,551 559,890,432
BASELINE 0.934 0.944 0.944 2,260,734,129 565,137,435
aiXplain 0.179 0.296 0.296 9,395,107 12,345

Table 42: Official results of the WMT22 Quality Estimation Task 3 (Critical Error Detection) for the Portuguese-
English (UNconstrained) dataset. Teams marked with "•" are the winners, as they are not significantly outperformed
by any other system based on randomisation tests with Bonferroni correction (Yeh, 2000). Baseline systems are
highlighted in grey.

99

You might also like