0% found this document useful (0 votes)
68 views4 pages

Wikimedia CX

This paper presents a new computer-assisted translation tool called Content Translation (CX) designed for translating Wikipedia articles. CX automates parts of the translation process and provides machine translation and a side-by-side editor. User research was conducted to understand translation needs and evaluate the tool, which aims to improve the experience over past translation tools.

Uploaded by

Leon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views4 pages

Wikimedia CX

This paper presents a new computer-assisted translation tool called Content Translation (CX) designed for translating Wikipedia articles. CX automates parts of the translation process and provides machine translation and a side-by-side editor. User research was conducted to understand translation needs and evaluate the tool, which aims to improve the experience over past translation tools.

Uploaded by

Leon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Content Translation: Computer-assisted translation tool

for Wikipedia articles

Niklas Laxström Pau Giner Santhosh Thottingal


University of Helsinki Wikimedia Foundation Wikimedia Foundation
Dept. of Modern Languages [email protected] [email protected]
and
Wikimedia Foundation
[email protected]

Abstract In this paper we present a new approach to sup-


port translation which has been designed taking
The quality and quantity of articles in each
into account the unique needs of Wikipedia content
Wikipedia language varies greatly. Trans-
and their community. Content Translation (CX) is
lating from another Wikipedia is a natural
a new tool that automates many steps of the trans-
way to add more content, but the trans-
lation process and validates the approach in prac-
lation process is not properly supported
tice. It was first enabled on 8 Wikipedias as an opt-
in the software used by Wikipedia. Past
in feature to create new articles in January 2015.
computer-assisted translation tools built
Selected language pairs have machine translation
for Wikipedia are not commonly used. We
(MT) support.
created a tool that adapts to the specific
needs of an open community and to the 2 Previous work
kind of content in Wikipedia. Qualitative
and quantitative data indicates that the new MediaWiki, the software powering Wikipedia, is
tool helps users translate articles easier and translated to hundreds of languages using the
faster. Translate extension. No such solution was avail-
able for translating Wikipedia articles, leaving a
1 Introduction gap in the translation support.
Wikipedia is the most multilingual encyclopedic There were at least ten instances of translation
knowledge archive, with over 280 languages with tools built for Wikipedia1 . Those tools can be di-
varying amount of content. Knowledge available vided into two groups based on whether the tool
for a user is limited by the languages used to ac- creators already possessed MT software. The first
cess it. Translation has been a common way to group is composed of companies such as Google
expand knowledge across languages in Wikipedia. and Microsoft, but also smaller companies and re-
The editing activity of the top 46 language editions searchers. The other group of tools has been cre-
of Wikipedia shows that 25% of edits by multilin- ated just for Wikipedia article translation, mostly
gual users are for the same article in different lan- by volunteers.
guages (Hale, 2013). Among the earliest tools were GTT by Google
It is not necessary to use any tool to trans- and WikiBhasha by Microsoft, using their own MT
late Wikipedia articles. However, it is a com- services (Garcı́a and Stevenson, 2009; Kumarana
plicated process and mainly done by experienced et al., 2011). Later, Casmacat for professionals and
Wikipedia editors. researchers (Alabau et al., 2013) and CoSyne for
There were many attempts to build tools multilingual MediaWikis (Bronner et al., 2012),
to support translation of articles. None has unlike Wikipedias which are monolingual.
seen widespread use: in our research only few Common to all such tools is that they are not in-
users reported using those tools when translating tegrated into Wikipedia. To use them one needs
Wikipedia articles. to go to another website or install software. CX
1
c 2015 The authors. This article is licensed under a Creative
⃝ Details collected at https://meta.wikimedia.org/
Commons 3.0 licence, attribution, CC BY. wiki/Machine_translation

194
is integrated into Wikipedia and provides a WYSI- users to translate the full article. As illustrated in
WYG editor (what you see is what you get). Figure 1, users add content to the translation one
paragraph at a time. When a paragraph is added,
3 Designing the translation experience an initial translation based on MT is provided for
The design of CX was aimed at improving the the user to edit. MT is used if available, but the
existing process users followed when translating. user can also start with the source text or an empty
Following the principles of User-Centered De- paragraph if that is preferred.
sign (Norman and Draper, 1986), we organised pe- Unlike other tools that define a strong bound-
riodic user research sessions to (a) better under- ary to translate on a per sentence basis, working
stand the user needs during the existing translation at a paragraph level allows users to reorganise sen-
process, and (b) validate new ideas on how to im- tences and accommodate different editing patterns.
prove this process. Provide context information
3.1 User research In CX the original article and the translation
are shown side-by-side. Each paragraph is dy-
We recruited 106 participants using a survey2 .
namically aligned vertically with the correspond-
From their responses we identified dictionaries
ing translated paragraph, regardless of the differ-
(76% of participants used them), and Wikipedia
ence in length. This allows users to quickly have
(60%) as their most used tools when translating.
an overview of what has already been translated
MT (53%), spell checkers (48%) and glossaries
and what has not.
(42%) were also common. Less than 6% of the
Contextual information reduces the need for the
participants mentioned tools specifically aimed at
user to navigate and reorient. When translating a
Wikipedia article translation, such as those de-
sentence, the corresponding sentence in the orig-
scribed in Section 2, and no tool was mentioned
inal document is highlighted. In addition, when
by more than one participant.
manipulating the content, options are provided an-
We organised 16 research sessions. Sessions
ticipating the user’s next steps. In Figure 1, based
were organised in two parts. In the first part, con-
on the user’s text selection, the user can explore
textual inquiry techniques (Beyer and Holtzblatt,
the article related to the selected text (in the source
1998) were applied to observe user behaviour
or target languages), or turn the selected text into a
while translating, and identify their needs. The
link. Dictionary can be accessed inside the tool by
second part was a usability testing study (Nielsen,
selecting a word or by using the search box in the
1994) to evaluate different design ideas in the form
tools column.
of prototypes.
Focus on the translation
3.2 Design principles
We identified many steps in the translation pro-
The research sessions were instrumental to guide cess that could be automated. Users spend time
the design of the translation experience3 . The fol- making sure each link they translated points to the
lowing design principles summarise the approach correct article in the target Wikipedia, and recre-
we followed when designing the tool. ating the text formatting that was lost when using
Freedom of translation an external translation service. They also look for
categories available in the target Wikipedia to clas-
There is a significant diversity in Wikipedia con-
sify the translated article, and save constantly dur-
tent across languages. On average, two articles
ing the process to avoid losing their work.
from different languages on the same topic have
CX deals with those aspects automatically.
just 41% of common content (Hecht and Gergle,
When adding a paragraph, the initial translation
2010). In contrast to other kinds of content, such
preserves the text format. Modifications to the
as software user interface strings or documenta-
translated content are saved automatically. Links
tion, Wikipedia articles are not intended to be exact
point to the right articles if they exist and ex-
translations that are always kept in sync. In order
isting categories are added to the article thanks
to support that content diversity, CX does not force
to Wikidata4 , a structured data knowledge reposi-
2
3
https://goo.gl/iKQIDh tory, that maps corresponding concepts across lan-
A detailed design specification is available at https://
4
www.mediawiki.org/wiki/CX https://www.wikidata.org

195
Figure 1: The source and translated content side-by-side and additional tools on the right.

guages. As those aspects are automated, users can without negatively affecting the translation pro-
focus on adapting content for the initial version of cess. CX is the first translation tool that provides
the article rather than on technical and formatting a WYSIWYG editor using the annotated HTML
tasks. provided by Parsoid.
Quality is key Some MT services neither support HTML in-
One of the concerns raised early by the partici- put nor provide reordering information. Preserv-
pants was about MT quality. Users were concerned ing markup is an essential requirement for CX be-
about the potential proliferation of low quality con- cause wikitext adaptation and WYSIWYG editing
tent in Wikipedia articles. are based on the markup. We devised an algorithm
In order to respond to that concern, CX keeps that can reconstruct the reordering information by
track of the amount of text that is added from MT making the MT service do some additional work.
without further modification by the users. When a
given threshold is exceeded, a warning is shown to
users encouraging them to focus on quality more 5 MT evaluation
than quantity.

4 WYSIWYG implementation We use the subjective evaluations of MT quality


MediaWiki’s wikitext is not standardised. For a for a given language pair to decide whether we
long time, the only way to use wikitext was to will include a MT service for a language pair in
render it to HTML with MediaWiki. Parsoid5 the tool. To evaluate a MT for a given language
is a Wikimedia project that implements a second pair, we ask the potential future users of the tool
parser for wikitext. To follow the principle fo- to translate articles using it and tell whether it was
cus on translation principle we only provide lim- useful for them or not.
ited editing and formatting options and side-step We run a MT service on our servers using
a lot of complexity of Wikipedia article structure the open-source Apertium project (Forcada et al.,
5
https://www.mediawiki.org/wiki/Parsoid 2011), but we support other MT providers as well.

196
6 Evaluation also shows that quality of the published transla-
tions is good, alleviating the community concerns.
Currently the tool is only available to self-selected The low translation activity in multiple languages
users (most of them experienced editors), hence where the tool is already available needs further re-
the results cannot be generalised to the whole com- search. Close integration in Wikipedia allows CX
munity. Further studies on the resulting quality to recruit users and suggest articles to translate in
over a long term will help. ways not possible with the previous tools.
The low deletion ratio for articles created us-
ing CX suggests that there are no major problems
in terms of quality. In three months of expos- References
ing the tool as an opt-in feature, 900 articles were Alabau, Vicent, Ragnar Bonk, Christian Buck, Michael
published using CX with an overall deletion ratio Carl, Francisco Casacuberta, Mercedes Garcı́a-
lower than 1% across all languages, which is lower Martı́nez, Jesús González, Philipp Koehn, Luis
than the deletion rate for all new articles. Leiva, Bartolomé Mesa-Lao, et al. 2013. Casmacat:
An open source workbench for advanced computer
We noticed that there is a significant difference aided translation. The Prague Bulletin of Mathemat-
between the number of created articles in differ- ical Linguistics, 100:101–112.
ent target language Wikipedias, which cannot be
Beyer, H. and K. Holtzblatt. 1998. Contextual De-
explained by the number of active users, number sign: Defining Customer-centered Systems. Interac-
of available articles to translate nor the availabil- tive Technologies Series. Morgan Kaufmann.
ity of MT. For example in three months the Cata- Bronner, Amit, Matteo Negri, Yashar Mehdad, Angela
lan Wikipedia saw 455 articles created by translat- Fahrni, and Christof Monz. 2012. Cosyne: Synchro-
ing from Spanish with CX, but in the Portuguese nizing multilingual wiki content. In Proceedings of
Wikipedia only 25. Both language pairs have MT the Eighth Annual International Symposium on Wikis
and Open Collaboration, WikiSym ’12, pages 33:1–
provided by Apertium. Statistics about the tool are
33:4, New York, NY, USA. ACM.
collected publicly6 .
We have not yet made precise measurements on Forcada, Mikel L, Mireia Ginestı́-Rosell, Jacob Nord-
falk, Jim ORegan, Sergio Ortiz-Rojas, Juan An-
translation time saving, but we got positive reports tonio Pérez-Ortiz, Felipe Sánchez-Martı́nez, Gema
from our users. In a roundtable7 organised with Ramı́rez-Sánchez, and Francis M Tyers. 2011.
editors of the Catalan Wikipedia, an experienced Apertium: a free/open-source platform for rule-
editor reported a 70% time saving. based machine translation. Machine translation,
25(2):127–144.
We found that English is the most used source
language, consistent with Hale’s findings on mul- Garcı́a, Ignacio and Vivian Stevenson. 2009. Reviews-
tilingual user behaviour (2013). google translator toolkit. Multilingual computing &
technology, 20(6):16.
7 Conclusions Hale, Scott A. 2013. Multilinguals and wikipedia edit-
ing. CoRR, abs/1312.0976.
We developed a tool that addresses the specific
Hecht, Brent and Darren Gergle. 2010. The tower of
needs of an open community and the specifics babel meets web 2.0: User-generated content and its
of the kind of content in Wikipedia. CX is a applications in a multilingual context. In Proceed-
computer-assisted translation tool with a WYSI- ings of the SIGCHI Conference on Human Factors in
WYG editor and automatic link adaptation. CX Computing Systems, CHI ’10, pages 291–300, New
York, NY, USA. ACM.
supports multiple different MT providers, but by
integrating the open source Apertium project we Kumarana, Narend, S Ashwani, and D Vikram. 2011.
were able to quickly provide MT for multiple lan- Wikibhasha: Our experiences with multilingual con-
tent creation tool for wikipedia. In Proceedings of
guage pair We developed MT education and track- Wikipedia Conference India, Wikimedia Foundation.
ing features to address community concerns about
proliferation of poor quality translations. Nielsen, Jakob. 1994. Usability Engineering. Interac-
tive technologies. AP Professional.
User feedback for CX is supportive and data
Norman, D.A. and S.W. Draper. 1986. User cen-
6
https://www.mediawiki.org/wiki/Content_ tered system design: new perspectives on human-
translation/analytics computer interaction. New Perspectives on Human-
7
https://blog.wikimedia.org/2014/09/ Computer Interaction Series. Lawrence Erlbaum As-
29/round-table-with-editors-from-the- sociates.
catalan-wikipedia/

197

You might also like