Localizing Apps A Practical Guide For Translators and Translation Students
Localizing Apps A Practical Guide For Translators and Translation Students
The software industry has undergone rapid development since the beginning of
the twenty-first century. These changes have had a profound impact on translators
who, due to the evolving nature of digital content, are under increasing pressure
to adapt their ways of working. Localizing Apps looks at these challenges by
focusing on the localization of software applications, or apps. In each of the five
core chapters, Johann Roturier examines:
With practical tasks, suggestions for further reading and concise chapter
summaries, Localizing Apps takes a comprehensive look at the transformation
processes and tools used by the software industry today.
This text is essential reading for students, researchers and translators working
in the areas of translation and creative digital media.
Johann Roturier
First published 2015
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
711 Third Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2015 Johann Roturier
The right of Johann Roturier to be identified as the author of this
work has been asserted by him in accordance with sections 77 and 78
of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or
reproduced or utilized in any form or by any electronic, mechanical,
or other means, now known or hereafter invented, including
photocopying and recording, or in any information storage or retrieval
system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks
or registered trademarks, and are used only for identification and
explanation without intent to infringe.
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book has been requested
ISBN: 978-1-138-80358-9 (hbk)
ISBN: 978-1-138-80359-6 (pbk)
ISBN: 978-1-315-75362-1 (ebk)
Typeset in Goudy
by HWA Text and Data Management, London
Contents
List of figures xi
List of listings xii
Acknowledgments xiv
1 Introduction 1
1.1 Context for this book 1
1.1.1 Everything is an app 1
1.1.2 The language challenge 3
1.1.3 The need for localization 4
1.1.4 New challenges affecting the localization industry 6
1.2 Why a new book on this topic? 8
1.3 Conceptual framework and key terminology 9
1.4 Who is this book for? 10
1.5 Book structure 12
1.6 What this book does not cover 14
1.7 Conventions 15
2 Programming basics 16
2.1 Software development trends 17
2.2 Programming languages 18
2.3 Encodings 21
2.3.1 Overview 21
2.3.2 Dealing with encodings using Python 23
2.4 Software strings 26
2.4.1 Concatenating strings 28
2.4.2 Special character in strings 32
2.5 Files 33
2.5.1 PO 33
2.5.2 XML 34
2.6 Regular expressions 38
2.7 Tasks 39
2.7.1 Setting up a working Python environment 40
2.7.2 Executing Python statements using a command prompt 41
viii Contents
2.7.3 Creating a small Python program 43
2.7.4 Running a Python program from the command line 43
2.7.5 Running Python commands from the command line 44
2.7.6 Completing a tutorial on regular expressions 45
2.7.7 Performing contextual replacements with regular expressions
(advanced) 45
2.7.8 Dealing with encodings (advanced) 46
2.8 Further reading and resources 46
3 Internationalization 49
3.1 Global apps 50
3.1.1 Components 50
3.1.2 Reuse 52
3.2 Internationalization of software 55
3.2.1 What is internationalization? 55
3.2.2 Engineering tasks 56
3.2.3 Traditional approach to the i18n and l10n of software strings 60
3.2.4 Additional internationalization techniques 64
3.3 Internationalization of content 68
3.3.1 Global content from a structural perspective 68
3.3.2 Global content from a stylistic perspective 70
3.4 Tasks 76
3.4.1 Evaluating the effectiveness of global gateways 77
3.4.2 Internationalizing source Python code 77
3.4.3 Extracting text from an XML file 79
3.4.4 Checking text with LanguageTool 80
3.4.5 Assessing the impact of source characteristics on machine
translation 80
3.4.6 Creating a new checking rule 81
3.5 Further reading 81
4 Localization basics 85
4.1 Introduction 85
4.2 Localization of software content 86
4.2.1 Extraction 86
4.2.2 Translation and translation guidelines 86
4.2.3 Merging and compilation 89
4.2.4 Testing 92
4.2.5 Binary localization 94
4.2.6 Project updates 95
4.2.7 Automation 96
4.2.8 In-context localization 97
4.3 Localization of user assistance content 98
4.3.1 Translation kit creation 100
4.3.2 Segmentation 100
Contents ix
4.3.3 Content reuse 103
4.3.4 Segment-level reuse 104
4.3.5 Translation guidelines 105
4.3.6 Testing 106
4.3.7 Other documentation components 106
4.4 Localization of information content 108
4.4.1 Characteristics of online information content 108
4.4.2 Online machine translation 109
4.5 Conclusions 109
4.6 Tasks 110
4.6.1 Localizing software strings using an online localization
environment 110
4.6.2 Translating user assistance content 112
4.6.3 Evaluating the effectiveness of translation guidelines 113
4.7 Further reading and resources 114
7 Conclusions 185
7.1 Programming 185
7.2 Internationalization 186
7.3 Localization 188
7.4 Translation 189
7.5 Adaptation 190
7.6 New directions 191
7.6.1 Towards real-time text localization 191
7.6.2 Beyond text localization 192
Bibliography 194
Index 204
Figures
A lot of people have helped me write this book. Writing this book was an
incredible journey, so I would like to start by thanking my wife, Gráinne, for
her patience, help and support, as well as family members and friends for their
encouragements. I would also like to thank the Series editors (Dr Sharon O’Brien
and Dr Richard Kelly Washbourne) for their patience and insightful comments
throughout the process. Assistance provided by Dr Kevin Farrell during
the editing phase was also greatly appreciated. I would also like to thank the
following organizations for allowing me to use screenshots of their applications:
Transifex, PythonAnywhere, Tilde, the Participatory Culture Foundation and
the Mozilla Foundation. My special thanks are extended to all people involved
in the open-source or standardization projects mentioned in this book. Finally
I would like to acknowledge all members from the ACCEPT, ConfidentMT,
CNGL and Symantec Localization teams, in particular Fred Hollowood, for all of
the stimulating conversations on localization-related topics over the years.
1 Introduction
This introductory chapter is divided into seven sections, covering the overall
context for this book, some justifications as to why a new book is required on
the topic of localization, a brief explanation of the key terminology used, the
intended audience, an overview of the book’s structure, the scope of the book,
and the conventions used throughout this book.
Interface
•Strings
•Content
Help
Content
[Conversations]
App
Functionality
Marketing
Collaterals 'Input
'Output
Content
'Processing
App Globalization
I18N Localization
Strings
Content Translation Non-Translation Adaptation
Formats
Strings O perations on strin g Content
ln p u t& and content
Content F unctionality
O utput
Testing Location
Function a lity
Access
Notes
1 http://www.gnu.org/software/gettext/manual/gettext.html#Concepts
2 http://www.commonsenseadvisory.com/AbstractView.aspx?ArticleID=1416
3 http://www.cipherion.com/en/news/243-more-irish-hotels-catering-for-non-english-
speaking-tourists
4 http://www.libreoffice.org/community/localization/
5 http://bit.ly/x3NmJH
6 http://www.culturalpolicies.net/web/ireland.php?aid=519
7 http://www.culturalpolicies.net/web/germany.php?aid=518
8 http://1.usa.gov/1wzTgsX
9 http://www.oscca.gov.cn/index.htm
10 http://nerds.airbnb.com/launching-airbnb-jp/
11 http://www.telegraph.co.uk/technology/apple/9039008/Apple-iPad-outselling-HP-
PCs.html
12 http://www.gartner.com/newsroom/id/2623415
13 http://translate.twttr.com/welcome
14 http://support.microsoft.com/
15 http://www.wordfast.net/
16 http://www.linkedin.com/groups/Why-is-so-difficult-find-44105.S.42456766
17 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
18 http://hg.python.org/peps/rev/76d43e52d978
19 A recent announcement by Adobe in fact confirmed it is stopping the development of
its Flash Player plug-in for mobile devices, since the alternative HTML5 technology
is universally supported: http://blogs.adobe.com/conversations/2011/11/flash-focus.
html
20 The code (including commands) is provided ‘as is’, without warranty of any kind,
express or implied, including but not limited to the warranties of merchantability,
fitness for a particular purpose and non-infringement. In no event shall the authors
or copyright holders be liable for any claim, damages or other liability, whether in an
action of contract, tort or otherwise, arising from, out of or in connection with the
code or the use or other dealings in the code.
21 http://localizingapps.com
2 Programming basics
1 import re
2name = "Johann"
3print "Hello from " + name #print text to standard output
2.3 Encodings
This semi-technical section is divided into two parts. The first one provides a
general overview of encodings, including a discussion on popular encoding
formats. The second part provides some hands-on examples on how to deal with
encodings using the Python programming language.
2.3.1 Overview
The previous section touched on key programming concepts, including
statements, variables and strings using as an example a high level programming
language, Python. In order to tackle more complex concepts, such as text
file manipulation, some clarification must be provided around the concept of
encoding. According to Wikipedia, an encoding ‘consists of a code that pairs
each character from a given repertoire with something else, such as a bit pattern
(…) in order to facilitate the transmission of data (generally numbers or text)
through telecommunication networks or for data storage’.4 In the very simple
example provided earlier in Listing 2.1, the data used were already present in the
program itself (e.g. “Johann”). Most of the time, however, the data that should
be manipulated comes from external sources, such as files. In these cases, it is
important to know what the encoding of these files is in order to process the
data accurately. This task may sound trivial because most programs (such as word
processing programs or text editors) often guess the encoding of files when they
open them. But when you are working with a programming language, you often
have to specify which encoding should be used. Before presenting how encoding
works in the Python programming language, the next section provides additional
background information on the concept of encoding, based on content found in
two comprehensive online resources.5, 6
Encodings must be understood in order to avoid character corruption issues
in localization projects. Such issues can occur when the original program does
not accommodate encodings other than the one used in the source language.
In such cases, there is very little a translator can do. However, an issue can also
occur when files are manipulated by a large number of stakeholders, including
people (such as translators) and systems. If a file is saved in an encoding that
differs from what is specified in localization guidelines, problems may occur later
on in the localization workflow. When dealing with multilingual text, problems
related to encodings must be addressed. As mentioned in the previous section,
computers only understand series of bits (1 or 0). These bits are grouped in bytes,
22 Programming basics
a byte being a group of precisely 8 bits used to encode a single character of text
in a computer. Most humans only understand a few natural languages, which
consist of a number of characters, possibly using a number of alphabets. For
instance, Japanese speakers will be familiar with ideograms (Kanjis), but will also
rely on phonetic syllabaries (such as Katakana and Hiragana) to express certain
words (such as loan words or function words). Foreign words (such as English
words) may also sometimes occur in the middle of Japanese text, so all of these
characters must be representable in a common format so that information can
be smoothly exchanged between a computer in Japan and another computer,
say, in Germany. These days, the Unicode standard allows for the exchange of
such multilingual information using a number of encodings.7 For example, the
Universal Character Set Transformation Format 8-bit (UTF-8) encoding is now
the preferred encoding for Web pages.8
The situation was, however, very different years ago, when computers were not
networked (and thus encoding incompatibilities were far less frequent). In order to
translate bytes (which do not have any meaning by themselves) into characters, a
convention is required. For example, the alphabet used by the English language relies
on a limited number of characters, which, for many years could be encoded using
a small, compact code called ASCII (American Standard Code for Information
Interchange). This code assigns a single byte to a specific character (for example,
66 for the upper case letter B). Similar codes existed for other languages, but each
was only good for representing one small slice of human language. For example,
8859-1 offered full coverage for languages such as German or Swedish but only
partial coverage for French (since characters such as œ were missing).
Besides, while this approach works well for languages that rely on a limited
number of characters (fewer than 256), it fails for those that require thousands
of characters (such as Chinese and Japanese). In Japan and China, this problem
was solved by the DBCS system (the double byte character set) in which some
letters were stored in one byte and others took two. These DBCS encodings then
evolved into multi-byte character sets (such as Shift-JIS, GB2312 and Big5)
which fall outside of the Unicode code page. The latter contains more than
65,536 possible characters.
In Unicode, a letter maps to a code point, which is a theoretical concept. For
every alphabet, the Unicode consortium assigns every letter a special number of
the form U+0639. This special number is called a code point and Unicode has
capacity for 1.1 million code points. While 110,000 of these are already assigned,
there is room to handle future growth. These code points must, however, be
encoded into bytes to be understood by computers. Multiple encodings exist,
including the traditional two-byte method called UCS-2 (because it has two
bytes) or UTF-16 (because it has 16 bits). A third encoding is the popular new
UTF-8 standard, which was mentioned earlier. This encoding is popular because
the Unicode code points can also be encoded in legacy encoding schemes, but
with the following caveat: some of the letters might disappear. If there is no
equivalent for the Unicode code point in the target encoding, a question mark
? may appear instead.
Programming basics 23
Listing 2.3 Decoding the content of a file into a Unicode string using Python 2.x
26 Programming basics
module. To access this module, it must be first imported, as shown on line 9. The
next statement (on line 11) is very similar to the one used on line 9 in Listing
2.2. This time, however, the file is opened using a specified encoding (UTF-8) so
that the decoding is done at the same time. If we check the length of the resulting
object on line 20, 5 is obtained again, showing that both approaches generate the
same result.
In the example in Listing 2.3, we have assumed that the encoding of the file
was UTF-8 (because this is the encoding we used when saving the file). However,
it would be very easy to come across an encoding problem if we tried to use the
wrong encoding when opening the file (say, UTF-16).
Listing 2.4 Selecting specific characters from a string using their position
Programming basics 27
l#Small game program asking a user to find a random number
2
3#Tell program to import the "random" module
4 import random
5
6#Generate a random number between 0 and 5
7 secret.number = random.randint(0,5)
8
9#Question to the user
lOquestion = "Guess the number between 0 and 5 and press Enter."
11
12while int(raw_input(question).s t r i p O ) != secret.number:
13 pass
14#Tell user that they have won the game
15print "You’ve found it! Congratulations"
up to the fifth character of the original string (but not including it). This may be a
bit confusing at first, especially since the first character has an index of 0, but this
is something that becomes easier to remember with a bit of practice.
Strings can be used in a number of contexts, not only to record textual
information for processing, but also to help the users of a program interact with
the program itself. For example, let’s look at the content of a small program called
secret.py, shown in Listing 2.5.
This program is very simple and, as mentioned previously, utilizes programmer’s
comments in the lines starting with the # symbol. For instance the first line
tells us that this is a game program that asks a user to find a random number.
The first line of code (statement) is actually on the fourth line, with the import
of functionality providing mechanisms to generate random numbers. The next
statement is on line 7 where a random number between 0 and 5 is generated. The
subsequent statement is on line 10 where a string is used. This string is going to
be used in the question that will be presented to the user of the program. The
next part of the program (line 12) is the core of the program. Computers are
very good at repetitions, so multiple statements are sometimes grouped together
in a sequence that is specified once but that may be executed several times in
succession. Such a sequence is known as a loop and this program uses a while
loop. This loop performs several steps:
The fifth step has two possible outcomes: if the given answer does not match
the secret number (the comparison being made with the != operator), line 13 is
28 Programming basics
$ python secret.py
Guess the number between 0 and 5 and press Enter.2
Guess the number between 0 and 5 and press Enter.5
You’ve found it! Congratulations
executed and the program passes. This means that the loop will return to the first
step and present the user with the question again. However, if there is a match,
the loop will be exited and the next line will be executed. In this case, success
will be achieved and the user will be notified. The second string of this program
occurs on line 15 as part of the print statement that lets the user know that they
have won the game. An example of what the user will see when playing the game
is shown in Listing 2.6.
Multiple lines are present in Listing 2.6. The first line, which starts with a dollar
sign character ($) corresponds to the command prompt followed by the command
that is used to execute the program. More information on command-line prompts
is provided in Section ‘Setting up a local working Python environment’. The
next two lines correspond to text that was shown to the user (starting with Guess
and finishing with Enter.) and user input (in this case, 2 and 5). In this example,
the user found the secret number after two attempts. When the program was first
run, the user was presented with the question and the answer they typed was
2. This did not match the secret number so the while loop was run again and
the question was posed again. The second time the answer given was 5, which
happened to match the secret number. The loop was therefore exited and the user
was greeted with a congratulatory message.
This simple program works fine, but there are a few modifications that can be
made in order to make it more flexible and easier to maintain in the future. These
modifications will help us introduce an important topic in programming and in
localization: the combination (or concatenation) of strings.
$ python secret2.py
Select a maximum number:10
Guess the number between 0 and 10 and press Enter.2
Guess the number between 0 and 10 and press Enter.5
Guess the number between 0 and 10 and press Enter.8
Guess the number between 0 and 10 and press Enter.1
You’ve found it! Congratulations
lg = "is"
2f = "Substitution"
3my_string = "#
/«s #
/0s fun" #
/« (f, g)
lg = "is"
2f = "Substitution"
3my_string = "#
/,(topic)s #
/,(copula)s fun" °/0 {"topic": f, "copula": g>
int(raw_input(question).strip())
$ python secret2a.py
Select
Si a maximum number:10
Gi
Guess the number between 0 and 10 and press Enter.
2
Gi
Guess the number between 0 and 10 and press Enter.
5
Guess the
Gi number between 0 and 10 and press Enter.
8
Guess the
Gi number between 0 and 10 and press Enter.
1
Yc
You’ve found it! Congratulations
question = "Guess the number between \"%d\" and \"%d\" and press
Enter.\n" \
% (min_number, max_number)
If the user specified 0 and 10 as input numbers, the string would appear as follows
when the program executes the statement from the while loop:
Guess the number between "0" and "10" and press Enter.
2.5 Files
Translators working in the localization industry must have an advanced
understanding of file formats. For example, the previous section focused on the
files used by the Python programming language, files usually ending with a .py
extension. It is, however, unlikely that translators will be given such files to
translate directly. As we will see in Section 3.2.3, files containing source code
are usually analysed by a program in order to extract translatable resources.
Such resources are then made available to translators in a container, which is
sometimes referred to as the translation kit (or transkit). This translation kit can
be passed by the client to a number of stakeholders, including language service
providers and translators. This transkit should ideally contain translatable strings,
but also any resources to be used during the translation process, such as glossaries,
translation memory matches, possibly machine-translation suggestions, and
translation guidelines. Depending on the amount of information and content
they contain, translation kits can vary in nature: some of them may be made
available to translators via an online application; others may be encapsulated
in a proprietary file format that can only be opened by a proprietary desktop
application. Finally, some projects may be encapsulated in an open format, such
as the Portable Object (PO) or XLIFF formats, which are discussed in the two
following sections.
2.5.1 PO
The PO format originates from the open-source GNU gettext project, which has
been used extensively to localize multiple applications making use of programming
languages such as C, PHP or Python (often in a Linux environment).13 PO files,
which are known as catalog files or catalogs, are text files that can be edited
34 Programming basics
blank line
# comments-by-translators
#. comments-extracted-from-source-code
#: origin-of-source-code-string
#, options-such-as-fuzzy
#| msgid previous-source-string
msgid "source-string"
msgstr "target-translated-string"
2.5.2 XML
XLIFF, the XML Localization Interchange File Format, is a type of XML
document which is used to exchange information during a localization
project.15 In order to understand better what XLIFF is, it is necessary to first
explain what XML is. XML can be described as a markup language, which,
as pointed out by Savourel (2001), is composed of two different components.
The first one is a metalanguage, with syntactic rules allowing the definition
of multiple formats. The second component is an optional document type
definition (DTD) which defines a format for a specific purpose using a pre-
defined number of keywords (known as the vocabulary). XML is one of these
metalanguages, which explains why multiple types of XML exist, serving very
different purposes. For example, XSL (Extensible Stylesheet Language) is a
type of XML which is used to transform XML documents into other formats,
while SVG (Scalable Vector Graphics) can be used to handle text and vector-
based graphics. In the software publishing industry, XML is commonly used to
create source documents (such as How To topics) because once the document
has been created, it can be transformed into multiple output formats, such as
Programming basics 35
11<para>
2 <indexterm xml:id="tiger-desc" class="startofrange">
3 <primary>Big Cats</primary>
4 <secondary>Tigers</secondaryx/indexterm>
5 The tiger is a very large cat indeed...
6 </para>
6
7
8 <para>
8
9 So much for tigers<indexterm startref="tiger-desc" class="endofrange"/>.
10
10 Let's talk about leopards.
11
ll</para>
a HTML page or a PDF file. This means that it is not necessary to create the
same information twice. Examples of popular XML DTDs used for source text
authoring include DITA (Darwin Information Typing Architecture), Docbook
and oManual.16, 17, 18 Listing 2.15 shows what a Docbook snippet looks like,
with text surrounded by markup.19
In this example provided under the terms of the GNU Free Documentation
License, a number of tags is used.20 A tag starts with a < character and ends
with a > character and contains some of the DTD’s pre-defined keywords. A
tag consists of a name, such as indexterm, and may also have attributes (such
as class). Attributes are additional properties, which provide supplementary
information about the tag or the tag’s contents (which may be textual).
Attributes have values, such as startofrange, which may be used to store
metadata (i.e. information about the data). These values may be pre-defined
or used in a customized manner. Let’s examine each line one at a time to
understand better what each tag does.
The first line contains an opening para tag without any attribute. This tag is
used to create a standard paragraph element (say within a chapter or an article).
The second line contains an opening indexterm tag, which is nested below the
para element (as shown by the indentation). Since the indexterm element
belongs to the para element, this relationship is often described as a parent/child
relationship. In this example, the indexterm element is a child element of the
para element. Such an element is used to identify text that must be placed in
the index of the document. This indexterm element has a couple of attributes:
the first one is xml:id and the second one is class. These attributes have the
tiger-desc and startofrange values respectively. As mentioned earlier, these values
provide additional information (known as metadata) about the actual content
contained in the XML structure. The value of the xml:id attribute and the value
of the class attribute indicate that this indexterm points to a document range
(rather than a single point in the document). The third line contains another
opening tag, this time a primary tag which is a child element of the indexterm
element. This primary element does not have any attribute but contains textual
content (Big Cats), which would appear in the index of the document. Finally, a
closing primary tag is used to indicate the end of the primary element. A closing
36 Programming basics
resembles an opening tag, except that the < character is followed by a forward
slash character. Unlike an opening tag, a closing tag cannot have any attribute.
The fourth line contains another child element of indexterm, a secondary
element. This element comprises an opening tag, textual content (Tigers) and a
closing tag. Finally this line contains the closing tag for the indexterm element.
In XML documents, the syntactic structure is created by the tags rather than the
line breaks or the indentation, which is why multiple elements are sometimes
present on the same line. This closing tag marks the beginning of the range
which is linked to this indexterm. This range starts on line 5, with the textual
content of the para element. Unsurprisingly this content refers to a tiger, with
the text starting with The tiger is a very large cat indeed…. This para element
ends on line 6 with a closing tag. Line 7 contains multiple elliptical dots which
indicate that the document may contain additional content, which still belong
to the range specified earlier with startofrange. A new paragraph starts on line
8 with an opening para tag, followed by textual content on line 9, which still
refers to tigers. Line 9 finishes with an indexterm element, which happens to
be an empty element. An empty element contains some information (such as
attribute values), but does not contain any textual content. Such elements are
easily identifiable with a forward slash preceding the closing > character. This
indexterm element is used here to indicate the end of the tiger-desc range which
had been created on line 2. This is confirmed by the textual content on line 10,
which mentions leopards. Finally, the second paragraph of this example finishes
on line 11, with the closing para tag. This example shows that XML markup
can be very useful to create (invisible) boundaries which span multiple logical
sections (such as paragraphs).
In the localization industry, XML is also extremely prevalent, with DTDs
such as TMX or XLIFF. TMX is the Translation Memory eXchange format,
which was initially developed by a special interest group of the now defunct
Localization Industry Standards Association (LISA).21 This format can be
used to export the content of a translation memory database into another
application. This scenario is likely to occur when multiple stakeholders are
involved. Some of these stakeholders may have a preference with regard to the
application that should be used during the translation process. In order to reuse
previous work stored in a different application, however, one needs to be able
to export and import translation memory segments. This is when TMX comes
to the rescue, by providing a container (DTD) which is understood by most
modern translation memory applications. Listing 2.16 shows an example of
such a document provided by the Okapi framework under a Creative Commons
3.0 BY-SA license.22, 23
Based on the detailed description provided for Listing 2.15, the XML structure
presented in Listing 2.16 should be quite straightforward to understand. This
example contains two tu elements, which correspond to translation units. Each tu
contains two child tuv elements, which differ based on the value of their xml:lang
attributes. For each translation unit, the first tuv element has an attribute value of
en-us while the second has a de-de value. These values refer to the American English
Programming basics 37
l<?xml version="l.0" encoding="UTF-8"?>
2<tmx version="1.4"Xheader ereationtool="oku_alignment" creationtoolversion="l"
segtype="sentence" o-tmf="okp" adminlang="en" srclang="en-us"
datatype="x-stringinfo"></header><body>
3<tu tuid="APCCalibrateTimeoutActionl_sl2">
4<prop type="Txt::FileName">filel_en.info</prop>
5<prop type="Txt::GroupName">APCCalibrateTimeoutActionl</prop>
6<prop type="Att::Test">TestValue</prop>
7<tuv xml:lang="en-us"Xseg>Follow the instructions on the screen.</segx/tuv>
8<tuv xml:lang="de-de"Xseg>Den Anweisungen auf dem Bildschirm
f olgen.</seg></tuv>
9</tu>
10<tu tuid="APCControlNotStableAction2_slO">
ll<prop type="Txt::FileName">filel_en.info</prop>
12<prop type="Txt::GroupName">APCControlNotStableAction2</prop>
13<prop type="Att::Test">TestValue</prop>
14<tuv xml:lang="en-us"><seg>Repeat steps 2. and 3. until the alarm no longer
recurs.</seg></tuv>
15<tuv xml:lang="de-de"><seg>Schritte 2 und 3 wiederholen, bis der Alarm nicht
mehr auftritt.</segx/tuv>
16</tu>
17</tu>
18 </body>
19</tmx>
and German (from Germany) locales, as shown by the textual content present in
the respective tu elements. Each tu element also contains additional information
(metadata) in the value of its tuid attribute and in child prop elements (such as
the name of the file where the segment originated from).
As mentioned earlier XLIFF is popular in the localization industry. For example,
a software publisher or language service provider may look after the extraction
of translatable content from source files (including code and documentation).
However, the actual translation may be done by a translator so content must flow
from one stakeholder to another as smoothly as possible (without information
loss). XLIFF may be used in this context to allow the transport of the information
from one system to another. Systems that make use of the XLIFF standard
sometimes need to extend it to add system-specific information. To achieve this,
the namespace mechanism may be used, whereby vocabularies from several DTDs
may be used in a single XML document. This can add complexity in some cases
because to make use of non-XLIFF information, systems must be aware of these
extra DTDs (which may not always be the case). Listing 2.17 shows an example
of an XLIFF document also provided by the Okapi framework under a Creative
Commons 3.0 BY-SA license.24, 25
The example provided in Listing 2.17 should be quite familiar after the
examples provided in Listing 2.15 and Listing 2.16. The first two lines of the
document refer to the version of the XLIFF DTD and namespace being used
(XLIFF 1.2). The third and fourth lines contain project-level information (a
file element with original, source-language and target-language
38 Programming basics
1<?xml version="l.0" encoding="UTF-8" ?>
2<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
3<file datatype="x-sample" original="sample.data"
4 source-language="EN-US" target-language="FE-FR">
5 <body>
6 <trans-unit id="l" resname="Keyl">
7 <source xml:lang="EN-US">Untranslated text.</source>
8 </trans-unit>
9 <trans-unit id="2" resname="Key2">
10 <source xml:lang="EN-US">Translated but un-approved text.</source>
11 <target xml:lang="FR-FR">text traduit mais pas encore approuvé.</target>
12 </trans-unit>
13 <trans-unit id="3" resname="Key3" approved="yes">
14 <source xml:lang="EN-US">Translated and <g id=,l’>approved</g>
text.</source>
15 <target xml:lang="FR-FR">Texte traduit et <g id=,l’>approuvé</g>.</target>
16 </trans-unit>
17 <trans-unit id="4" resname="Key4">
18 <source xml:lang="EN-US">Some other text.</source>
19 <alt-trans>
20 <source xml:lang="EN-US">Gther text.</source>
21 <target xml:lang="FR-FR">Autre text.</target>
22 </alt-trans>
23 </trans-unit>
24</body>
25</file>
26</xliff>
2.7 Tasks
This section contains six basic tasks and two advanced tasks:
pythonany where
$ python /home/j3r/scrap/first.py
hello world
$ cd /home/j3r/scrap/
$ python first.py
hello world
our program). For this command to succeed, however, we need to make sure
that the Python interpreter can find the first.py program. If this program is not
located in the current working directory, the error shown in Listing 2.22 will
occur.
To solve this problem, two solutions exist. The first solution consists in
providing the absolute name of the file (including the directory where it is
located). If you are using a Windows system, you should include double quotation
characters before and after the file name (e.g. “C:\user\My Documents\first.py”)
The second one consists in changing the working directory to the directory
containing the first.py file (using the cd command). The two solutions are shown
in Listing 2.23.
If you have decided to use PythonAnywhere as your working environment,
you will need to start a different console, a Bash console as shown in Figure 2.3.
When you do so, you will be presented with a command-line environment, in
which you can run your Python program. Note that this environment allows you
to upload files or even create files using a Web interface (using the Files tab from
Figure 2.3.)
1 Import the modules giving you access to codecs and regular expressions
functionality.
2 Read the content of a TMX file as UTF-8 and store its content in a variable.
3 Define a contextual regular expression to find all occurrences of a target
language word. This expression should be defined in such a way that source
language words will not be found.
4 Replace all occurrences of this word with a word of your choice and print the
resulting content to screen.
Once you have created this program and saved it in a file, you should be able
to run it from the command line.
46 Programming basics
Notes
1 http://www.codecademy.com/tracks/python
2 http://www.bleepingcomputer.com/tutorials/windows-command-prompt-
introduction
3 http://www.ee.surrey.ac.uk/Teaching/Unix
4 http://en.wikipedia.org/wiki/Character_encoding
5 http://nedbatchelder.com/text/unipain.html
6 http://www.joelonsoftware.com/articles/Unicode.html
7 http://www.unicode.org/standard/standard.html
8 http://www.w3.org/QA/2008/05/utf8-web-growth
9 https://www.pythonanywhere.com
10 This example, like other code snippets from this section, can be found on the book’s
companion Web site.
11 http://en.wikipedia.org/wiki/Hard_coding
12 Error messages can be sometimes slightly cryptic, especially when one starts learning
a language. Copying and pasting these error messages into a search engine often
provides valuable information since it is quite frequent for a problem to have been
previously experienced by other users.
13 http://www.gnu.org/software/gettext/manual/gettext.html#PO-Files
14 http://www.poedit.net/
15 At the time of writing, version 1.2 was the official OASIS standard (http://docs.oasis-
open.org/xliff/xliff-core/xliff-core.html) but version 2.0 was on the verge of replacing it.
16 http://dita.xml.org
17 http://docbook.org
18 http://www.omanual.org/standard.php
19 http://www.docbook.org/tdg5/en/html/ch02.html#ch02-makefrontback
20 http://www.docbook.org/tdg5/
21 http://www.gala-global.org/oscarStandards/tmx/tmx14b.html
22 https://code.google.com/p/okapi/source/browse/website/sample14b.tmx
23 http://creativecommons.org/licenses/by-sa/3.0/
24 https://code.google.com/p/okapi/source/browse/website/sample12.xlf
25 http://creativecommons.org/licenses/by-sa/3.0/
26 Accessible from the book’s companion site.
27 http://www.python.org/images/terminal-in-finder.png
28 http://docs.python.org/2/using/windows.html#installing-python
29 http://www.python.org/download/releases/
30 http://windows.microsoft.com/en-US/windows7/Command-Prompt-frequently-
asked-questions
31 http://docs.python.org/2/using/windows.html#configuring-python
32 http://showmedo.com/videotutorials/video?name=960000&fromSeriesID=96
33 https://www.pythonanywhere.com
34 http://ipython.org
35 http://notepad-plus-plus.org/
36 Accessible from the book’s companion site.
48 Programming basics
37 http://en.wikipedia.org/wiki/GB_18030
38 https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
39 http://greenteapress.com/thinkpython/html/
40 https://developers.google.com/edu/python/
41 http://www.codecademy.com/tracks/python
3 Internationalization
3.1.1 Components
The application used as an example in this book is a very simple Web application
that can be accessed by any Web browser.2 These days a lot of applications are
written in such a way in order to reach a wide range of users regardless of the
operating system they are using. In our example, the Web application itself is
written using a combination of technologies, including the Python programming
language that was introduced in Chapter 2, HTML (which is the main markup
language for displaying Web pages), and JavaScript libraries (namely JQuery,
JQuery UI and JQuery mobile).3 JavaScript is another programming language
which can be interpreted by Web browsers in order to create rich user interfaces
and make Web pages more dynamic. In our example the application is
accompanied by a set of additional HTML and PDF pages, which are generated
from XML content. While these pages could easily be generated by the Web
application itself, it seems important to introduce a number of technologies and
file formats to show and discuss multiple internationalization and localization
strategies.
The main component of our Web application is powered by the Python
programming language thanks to functionality made available within the Django
Web framework. The Django framework is an open-source project whose goal is
to make it easier and quicker to build Web applications with less code.4 Without
going into detail of this framework, it is important to present some of its key
components. The Django framework makes it possible to build applications
Internationalization 51
in a reusable manner based on a clear distinction between content storage,
manipulation and presentation. This approach is very different from earlier Web
sites (say, static HTML pages), which used to mix these three components, making
content maintenance and updates very difficult. Besides providing this modular
approach, the Django framework also offers great support for internationalization
and localization, which allows us to show the differences between an application
that is not internationalized and an application that is internationalized. In order
to develop our Web application, the following steps were required:
1 Decide how to store the data (content) used by the Web application. In our
case, the content is a set of sport (basketball) news items generated by a news
provider. These items are stored in a database for easy retrieval.
2 Decide how to present the content to the user. In our application, this
is done using a list, but other methods could be employed (e.g. a table, a
carousel). Since this presentation layer may change independently of the
data, templates (such as the ones used by the Django framework) are often
used to allow for quick modifications of the final appearance of the HTML
page.
3 Decide which functionality to make available to users. Our simple application
has only limited functionality since the only actions that can be performed
from the page include filtering the news items based on specific words and
going to the news provider’s Web site to read more about a particular news
item or player.
4 Give a name to this application. Since its purpose is to provide news items
related to the National Basketball Association (NBA) to a large audience,
the name NBA4ALL was chosen.
3.1.2 Reuse
Reusing some (or most) components during the development and publishing
of an app is a core principle of the software publishing industry. Whenever
possible, software developers will reuse existing functionality instead of creating
them from scratch. Existing functionality can be found in previous apps or in
external collections of functionality (libraries or frameworks), which may be
licensed commercially or freely. There are cases when it does make sense to start
from scratch, but the myriad of open-source projects available on sites such as
Github or Bitbucket are a great place to start reusing somebody else’s code (if the
license allows it of course).5, 6 Reuse is such a pervasive element in the software
development lifecycle that it has a major impact on (at least) two aspects of a
global application. First of all, some of the text strings used to create the user
interface may be reused from one place to another in order to save precious time
and lines of code. As explained in the second part of this chapter this strategy can
work well in some cases, but it may have serious consequences when the context
changes. Second, some content (such as a file) can be written once but reused
multiple times in a variety of contexts. The previous section already covered
this scenario since the NBA4ALL application relies on HTML content that is
generated using the same Python code regardless of the target device used to
access the application.
A similar reuse approach can be used to generate a number of documentation
files from a single source file. In the past, some of the documentation of a software
product was created in a word processing application, such as Microsoft Word or
Adobe FrameMaker, without necessarily following a strict template or schema.
The source files were then transformed into an output format such as a PDF file,
whose layout often had to be tweaked by desktop publishers before it could be
published. These days, source file formats based on markup languages such as
XML are regularly used for the creation of documentation sets. Adding structure
to the source content makes it easier to manage (and reuse), tasks that can be
supported with the use of commercial programs.7, 8 A format such as XML also
presents the advantage of being easy to manipulate by (automatic) systems,
which means that adjustments to the generated output files are not as frequent as
what they used to be in the past. XML can be used to create several output types,
including HTML and PDF as examples in this section. The first step in creating
global content is to start with a source file (which can be created using a text
editor or dedicated XML editor), as shown in Listing 3.1.
The format presented in Listing 3.1 should look familiar based on what was
introduced in Section 2.5.2. This document starts with an XML declaration to
refer to a specific version of the DocBook standard. It is then composed of an article
Internationalization 53
<?xml version="l.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http ://docbook.org/xml/4.2/docbookx.dtd">
<article>
<t itle>NBA4ALL Documentat ion</t itle>
<sectl>
<title>Filtering a list of headlines</title>
<para>
By default, ten headlines are shown in the application’s main page.
In order to filter this list, a word can be entered in the Search
box.
The list will change as soon as you start typing in the text box.
</para>
<note><title>Limitations</title>
<para>It is currently not possible to search for multiple
words.</para>
</note>
<tip><title>Tip</title>
<para>Both titles and descriptions are searched.
Searching for generic words may return more results than originally
thought.</para>
</tip>
</sectl>
</article>
element, which contains a title and a section (sect1). This section comprises a
title, a paragraph, a note and a tip. Both the note and the tip contain a title and
a paragraph. The purpose of the document is to describe the core functionality of
the NBA4ALL application. Even though the application is extremely basic, some
of its characteristics may be worth describing to aid novice users. For example,
the search functionality of the application only supports one word, so a note
element is used to mention this limitation. An additional recommendation is
also provided in the tip element.
It is worth pausing for a moment to reflect on the very narrow focus of this
document, which is about filtering a list of headlines. Creating documents
with a narrow focus on a specific topic is a key characteristic of global content
publishing. Once again, one of the main advantages of such an approach is
that these small chunks of information can be reused in multiple contexts.
For example, it is quite common for a software product to have a short Getting
Started guide, a longer user guide, and possibly an even longer administration
guide. Depending on the target audience(s), parts of these documents may be
common to all documents. Rather than creating monolithic documents, it is
therefore preferable to break these documents into smaller chunks (or topics)
with a view to using them more than once. Obviously, creating a large number of
chunks can lead to information management issues (e.g. is a chunk really suitable
for multiple contexts? Is the chunk management system powerful enough to
ensure that it is more efficient to look for an existing chunk instead of creating
54 Internationalization
$ xsltproc -o doc.html
/usr/share/xml/docbook/style sheet/nwalsh/xhtml/docbook.xsl doc.xml
$ xsltproc -o doc.fo /usr/share/xml/docbook/stylesheet/nwalsh/fo/docbook.xsl
doc.xml
Making portrait pages on USletter paper (8.5inxllin)
$ fop -pdf doc.pdf doc.fo
NBA4ALL Documentation
Table of Contents
Limitations
Tip
Both titles and descriptions are searched. Searching for generic words may return more results than originally thought
NBA4ALL Documentation
Table of Contents
Filtering a list o f h ead lin es.................. 1
Limitations
it is currently not passible to sc arch for multiple words.
Tip
Both titles and descriptions arc searched. Searching for gcneric words m ay return more results
than originally thought.
Formats
It is also very important to make sure that locale-specific information (such as
dates and times) is handled correctly by an application. The Django framework
provides such functionality given that internationalization and localization are at
the core of this project’s philosophy.16 By activating a feature in the application’s
configuration settings, dates and times are subsequently displayed using the
format specified for the current locale. In our scenario, translations have yet to be
performed for most locales, but date-related strings appear in all languages when
the user selects a language from the gateway’s language list.17
When applications cannot rely on a framework, such as Django, to
provide internationalization support, they sometimes have to rely on the
internationalization of the platform on which the application is executed. For
example, a desktop application can leverage some of the settings offered by an
operating system (such as Windows or Linux). Dedicated resources also exist for
programming languages that do not handle standards such as Unicode by default
(e.g. those provided through the ICU project).18
Manipulating data in a range of languages is no trivial task. Most programming
languages provide core functionality to perform basic text manipulation tasks
regardless of the language being manipulated (e.g. extracting the first character
of a text string as demonstrated in Listing 2.4 in Section 2.4). However, more
advanced functionality will sometimes be limited to certain languages. For
58 Internationalization
example, let us consider sorting a list of text strings in alphabetical order based on
the first character of each string. If the function sort is limited to characters from
the English alphabet (a to z), this function will fail or return incorrect results for
languages that use accented characters or do not use any English character. Dealing
with this type of issue falls under the remit of internationalization engineering or
functional adaptation (rather than translation tasks), but it is worthwhile being
aware of them. In some cases, adding support for additional languages may require
some translation or adaptation tasks in which translators or language engineers
may be involved. This point will be covered in greater detail in Section 6.3.3.
when the domain names were developed, they were seen as a tool to enable
the navigation of the network – to facilitate communication among the
network’s connected computers. They were not intended to communicate
anything in themselves. In the past fifteen years, however, TLDs and ccTLDs,
in particular, have, by their use and governance, constructed a space that
outwardly communicates cultural identities and values.
This is why the list of top-level domain names is now more than twice as
long as it was in the 1980s (with meaningful TLDs such as .works or .yokohama
being sponsored by specific entities).22 Even though some people will argue that
the ability to register internationalized top-level names is motivated by financial
considerations (i.e. in 2012, the initial price to apply for a new gTLD was
$185,000), one must admit that allowing non-ASCII characters in addresses is
long overdue.23 Since registering multiple domain names can be expensive, the
ISO codes are sometimes used as a prefix (e.g. http://de.mydummydomainname.com
or http://fr.mydummydomainname.com).
When looking at the strings in Figure 3.4, we can see that most of the
extracted strings, such as published or subheading do not appear in the actual
NBA4ALL application so they should not have been extracted (because they are
not translatable). To work around this problem, it is possible to look at the code
to check whether strings are translatable or ask the author of the application.26
Both solutions are time consuming compared to the one described below.
A typical software internationalization and localization workflow therefore
involves a number of steps:
1 marking translation strings in the source code
2 extracting them into a translation-ready format
3 translating them
4 compiling the resource containing the translated strings
5 loading the translated resources into the application.
Since the focus of this chapter is on internationalization, we will concentrate
on the first two steps in the present section and the next two sections. The
last three steps will be covered in Chapter 4. As mentioned earlier, the Django
framework makes it easy for Web developers to internationalize their applications
by marking text strings that require translations. Such marking is required in at
least two types of files: the Python code itself and the templates that are being
used to generate the final HTML pages.
In order to identify or mark translatable strings in the Python code of a Django
application, a special function is imported, translatable strings are prefixed using
an underscore character _ and wrapped within brackets, as shown in Listing 3.3.
62 Internationalization
This code snippet should look familiar to you by now. A special function
ugettext is imported on line 1 from the django.utils.translation
module. In order to avoid repeating the typing of ugettext in front of every
string (which would increase the size of the program), it is mapped to the
underscore character as a shortcut. The underscore character is used on line 6 to
wrap the text string assigned to the subheading variable (Your latest NBA headlines).
You may have noticed in this example, however, that two strings are not marked
with the underscore characters (headlines and published on line 3). These strings
are not translatable because they are used internally by the application to perform
specific tasks (that are invisible to the end-user of the application). Using the
ugettext function is therefore crucial in order to identify with confidence
strings that are translatable from strings that are not translatable (even though
they might look like they are).
What is therefore interesting to point out is that some extra lines of code are
required to internationalize the application. By default (and based on what was
presented in the previous chapter), the variable subheading would be defined by
assigning the text string as follows:
contents of some of the lines are not included in <% trans %> blocks. The string
on lines 5 and 11 is the name of the application (NBA4ALL), so in this scenario
it is deemed to be non-translatable. Obviously this decision is debatable because
application (or even brand) names are sometimes translated or adapted during
localization. Various approaches to handle this issue, which is specific to digital
content, are discussed by Baruch (2012). A special code block is also present on
line 23: {{ subheading }}. This block is used to insert the content of the
subheading variable that was defined in the Python code itself in Listing 3.3.
This example illustrates how Django’s templating system works. Variables present
in the HTML document (from Listing 3.4) can be replaced with content (e.g.
strings) defined in the Python code (Listing 3.3). This approach is extremely
popular in modern Web applications because it allows back-end developers (e.g.
Python developers) and front-end developers (e.g. HTML designers) to focus on
what they know and do best. When text gets created by two (or more) different
individuals, however, consistency issues may arise, which is why additional
internationalization techniques are presented in Section 3.2.4.
Variations of such an i18n and l10n workflow are possible depending on
the programming language or framework that is used to develop the source
64 Internationalization
l#The source_strings.py file contains: subheading = "Your latest NBA headlines."
2 import source_strings
3
4def home(request, collection="headlines", direction=-l, key_sport="published"):
5
6 print source_strings.subheading
application. For instance, rather than marking translation strings in the source
code and extracting them into a translation ready format in a separate step, one
may decide to externalize translation strings directly into a strings-only file.
Listing 3.3 showed that the developer of an internationalized Django application
could still define source strings in the middle of source code. Other programming
languages and frameworks rely on separate files to completely isolate source
strings. Listing 3.5 shows how this could be achieved in the Python programming
language by adapting slightly the code shown from Listing 3.3.
In this adapted example, an external file (source_strings.py) is used to store all
strings, which can then be used by other parts of the program by (i) referencing
the external file (in this case, by importing the module on line 2 in Listing 3.5)
and (ii) accessing specific strings using arbitrary names (e.g. on line 6 with
source_strings.subheading).
This approach is commonly used in Windows applications that rely on the
.NET framework. In this framework, the external files containing translatable
source strings are called .RESX files (because the XML format is used to store
these resources). Similarly Java programs rely on properties files.27 It is also
possible to come across proprietary formats used by software publishers who have
decided not to rely on existing formats or could not do otherwise because the
language or framework did not provide a standard internationalization method.
Deciding whether two steps should be used instead of one will largely depend
on the framework or programming language used during development. From a
translation perspective, there should be minimal impact on the actual translation
work, but it does not do any harm to know which upstream steps were used to
generate the file requiring some translation.
To some extent, these factors are not specific to the ICT sector, since similar
issues are often reported in the film industry (e.g. confidentiality). Regardless
of the motivation for not providing any context or comments to translators,
localization-specific issues are likely to occur, especially when the product and
number of target languages are large. Such issues include mistranslations of
ambiguous source strings or truncated translated strings because of length issues.
These issues often have to be resolved during a localization quality assurance
step, but they could easily be avoided if more time was spent preparing the source
strings in the first place. Besides providing comments, other examples of source
string preparation include avoiding string concatenation, using meaningful
variable names (as discussed in Section 2.2), and paying special attention to the
way the plural form is generated.
The topic of string concatenation was introduced in Section 2.4.1. This
technique can be very appealing to application developers because it means
that they have less code to type. It is easy to see why this approach can lead to
significant translation problems where languages whose word order differs from
the source language are concerned. In the example in Listing 3.6, three short
strings on line 1 are concatenated in a single string on line 2.
This approach may be tempting from a reuse perspective, because if the
topic of the application was changed from basketball to American football
(i.e. from NBA to NFL), two strings might be reusable in English (Your latest
and headlines). Similarly, if the application also contained a section on tweets,
it might be possible to reuse Your latest and NBA to form the string Your latest
NBA tweets. However, major problems are bound to happen either during the
1# In:
2first, second, third = "Your latest", "NBA", "headlines"
3# In:
4subheading = "7,s "/,s "/,s." "/, (first, second, third)
5# In:
6print subheadings
7# Out:
8# You latest NBA headlines
9# In
lOfirst, second, third = "Vos derniers", "NBA", "titres"
11# In:
12subheading = "7,s "/0s "/0s." "/, (first, second, third)
13# In:
14print subheadings
15# Out:
16# Vos derniers NBA titres
17#In:
18subheading = "7,(first)s ‘ /.(second)s "/.(third)s." "/, {"first": first, "second":
second, "third": third}
19# In:
20print subheadings
21# Out:
22# Vos derniers titres NBA
NBA4ALL
makes use of a character class. The character class (defined with the opening
and closing brackets) contains an initial caret character (^) which negates the
following characters (i.e. the * and + characters) to express the fact that any
character but the * and + character is allowed. While this rule is human-readable,
its target user is likely to be a program (e.g. a translation program) configured to
check that this rule is adhered to by entities manipulating this document (e.g. a
human translator during a translation step in a localization workflow).
While the present section focused on the structure of documents, the following
section presents some content internationalization principles from a stylistic
perspective.
Language checkers
A CL checker may be defined as an application designed to flag linguistic
structures that do not comply with a predefined list of formalized CL rules.
Traditionally, most checkers have operated at a sentence level. For instance,
Clémencin (1996: 34) states that ‘the EUROCASTLE checker works at the
sentence level and has very little knowledge of the context.’ This can obviously
be problematic if some of the structures to identify include phenomena such as
anaphora (which may require resolution at the paragraph or document level).
Simpler programs, also known as proofreading or style checking programs, can
also be extended to perform some controlled language rule checks. The open-
source LanguageTool program falls into this category.43 It is defined by its author
as proofreading software that ‘finds many errors that a simple spell checker
cannot detect and several grammar problems.’ This tool is available in multiple
forms, ranging from an extension for open-source word processing programs
such as OpenOffice.org and LibreOffice to a standalone application. It checks
plain text content in a number of languages and detects text patterns (using a
number of techniques including regular expressions). Most rules are defined
in an XML format so that they can be easily edited and refined by end-users.
More complex rules can also be created using the Java programming language.
Figure 3.6 shows both input text and the results of the check in LanguageTool’s
graphical interface.
The results of the check (i.e. the rule violations) are reported by LanguageTool
in the bottom part of the user interface. Once the text present in the top part of
the interface is checked by the program, results are reported to the user including:
1 The position of the character that violates a particular rule. For example, the
first problem appears on line 8 in column 1 (where column means the first
character of the line).
2 The description of the rule (e.g. Sentence is over 40 words long, consider
revising).
3 Some context, including all the words that match this particular rule,
previous characters and following characters.
Internationalization 75
This example shows that the detection rules have descriptions that read like
suggestions. It is therefore down to the user to decide whether implementing the
change will improve the overall quality of their text. It must be said, however,
that some of the rules can sometimes over-generate by triggering in contexts that
are perfectly legitimate (these are known as false alarms). When the precision of
the rule is too low, the rule can even become a source of frustration, which is why
it is sometimes possible to disable (or deactivate) a particular rule. The opposite
scenario is also possible. When a rule does not trigger in a context where it would
be expected to trigger, it is because the rule does not have a perfect recall. This
can be explained by a number of reasons: since rules tend to be created by people,
it is possible that these people have not thought of all possible combinations a
rule should cover. Another reason concerns the tools and resources that are being
used to power the checking procedure. In the example above, some of the rules
are more complex than others. For instance, the rule that detects a series of three
nouns has to rely on an external tool to determine what a noun is. Such a tool is
known as a part-of-speech tagger since it assigns a part-of-speech (POS) to each
word (or token) in a particular segment. LanguageTool allows users to assign POS
tags to their input text and to see the results of this process in the bottom part of
the interface, as shown in Figure 3.7.
76 Internationalization
Each word from the input text is followed by bracketed information (including
possible dictionary forms and part-of-speech tags separated with a / character).
In the example above, we can see that the sequence basketball news application
is detected as a series of three nouns in Figure 3.6 while the sequence basketball
news list is not. This is due to the fact that the word list is ambiguous and can
be assigned a verb part-of-speech in certain contexts. This is confirmed by the
output shown in Figure 3.7, where list was tagged as a noun (with the NN tag) but
also as a verb (with the tags VB VBP). Because of this ambiguity the rule did not
trigger in this particular context. This example shows that the right balance must
be found between false alarms and silence in order to make sure that the expected
benefits of using a language checker are obtained (e.g. improving the text’s
readability or machine translatability). Some of the tasks presented in the next
section focus on addressing specific problems associated for language checking
rules (e.g. evaluating the impact of source modifications on translation quality).
3.4 Tasks
This section contains three basic tasks and three advanced tasks:
Once you have made your modifications, you could run the xgettext tool
using the following command in a Linux environment:
xgettext -c secret3.py
$ xgettext -a secret3.py
secret3.py:36: warning: ’msgid’ format string with unnamed arguments cannot be
properly localized:
The translator cannot reorder the arguments.
Please consider using a format string with named arguments,
and a mapping instead of a tuple for the arguments.
secret3.py:47: warning: ’msgid’ format string with unnamed arguments cannot be
properly localized:
The translator cannot reorder the arguments.
Please consider using a format string with named arguments,
and a mapping instead of a tuple for the arguments.
#. TRANSLATORS: This question asks the user to pick a number and press a key
#: secret3.py:39
#, python-format
msgid "Guess the number between 0 and 7,(max_number)s and press ‘
/.(key)s."
msgstr ""
#. TRANSLATORS: This string tells the user that they have found the number
after a certain number of attempts
#: secret3.py:50
#, python-format
msgid ""
"You’ve found ’"/,(secret_number)d’ in "/.(attempts)d attempts! Congratulations!"
msgstr ""
This command should create a messages.po file in your working directory (where
the -c parameter instructs the program to extract comments preceding lines with
translatable strings). You can then open this file using a text or dedicated editor
to check that all strings have been extracted alongside their comments. Ideally it
should look more or less like the solution file from Listing 3.11.
To check some text, you can type your own text in the top window of the
standalone program or in the demo window of the online application. If you
use the standalone program, you can check text files such as the solution file
provided for the previous exercise: udoc.out.46 If you use the online version,
you can copy and paste the content of this solution file into the demo window.
Take some time to edit your input text based on the problems identified by
LanguageTool. While doing so, you should ask yourself whether some of your
changes may lead to new problems if you triggered another check. If that was
the case, what should you do?
Towards the end of the chapter, some natural language processing concepts
(such as part-of-speech tagging) were briefly mentioned. Further information on
this topic can be found in Bird et al. (2009) or Perkins (2010) with examples
using the Python programming language.
Notes
1 https://msdn.microsoft.com/en-us/library/ekyft91f(v=vs.90).aspx
2 http://app1.localizingapps.com
3 http://jquerymobile.com/
4 https://www.djangoproject.com/
5 https://github.com/
6 https://bitbucket.org/
7 http://www.madcapsoftware.com
8 http://www.adobe.com/ie/products/robohelp.html
9 http://idpf.org/epub
10 http://xmlsoft.org/XSLT/xsltproc2.html
11 https://help.ubuntu.com/community/DocBook#DocBook_to_PDF
12 http://sourceforge.net/projects/docbook/
13 http://www.w3.org/wiki/Its0504ReqKeyDefinitions
14 http://www.w3.org/International/questions/qa-choosing-encodings#useunicode
15 http://support.apple.com/kb/HT4288
16 https://docs.djangoproject.com/en/1.7/topics/i18n/formatting#overview
17 Similar benefits can be achieved for traditional Python programs using the Babel
internationalization library: http://babel.pocoo.org/
18 http://site.icu-project.org
19 A’ Design Award & Competition, Onur Müştak Çobanlɪ and Farhat Datta: http://
www.languageicon.org/
20 http://www.w3.org/TR/i18n-html-tech-lang#ri20040808.173208643
21 https://www.iso.org/obp/ui/#search
22 https://www.iana.org/domains/root/db
84 Internationalization
23 http://en.wikipedia.org/wiki/Generic_top-level_domain#Expansion_of_gTLDs
24 https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html
25 http://virtaal.translatehouse.org/
26 http://www.framasoft.net/IMG/pdf/tutoriel_python_i18n.pdf
27 http://docs.oracle.com/javase/tutorial/i18n/intro/steps.html
28 http://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms
29 https://docs.djangoproject.com/en/1.7/topics/i18n/translation/#pluralization
30 http://translate.sourceforge.net/wiki/l10n/pluralforms
31 http://msdn.microsoft.com/en-us/library/aa292178(v=vs.71).aspx
32 https://launchpad.net/fakelion
33 http://www.w3.org/International/articles/article-text-size
34 http://www.w3.org/
35 http://www.w3.org/TR/html-alt-techniques#sec4
36 http://www.w3.org/TR/UNDERSTANDING-WCAG20/visual-audio-contrast-text-
presentation.html
37 http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.
html#the-track-element
38 http://html5videoguide.net/code_c9_3.html
39 http://www.w3.org/TR/its20/
40 http://www.w3.org/TR/its20#potential-users
41 http://www.w3.org/TR/its20#datacategory-description
42 h t t p : / / w w w. w 3 . o r g / T R / 2 0 1 3 / R E C - i t s 2 0 - 2 0 1 3 1 0 2 9 / e x a m p l e s / x m l / E X -
allowedCharacters-global-1.xml Copyright © [20131029] World Wide Web
Consortium, (Massachusetts Institute of Technology, European Research Consortium
for Informatics and Mathematics, Keio University, Beihang). All Rights Reserved.
http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231
43 http://languagetool.org/
44 http://languagetool.org/languages/
45 https://www.languagetool.org/
46 Accessible from the book’s companion site.
47 Accessible from the book’s companion site.
48 http://www.accept-portal.unige.ch
49 http://itranslate4.eu
50 https://translate.google.com/
51 http://www.bing.com/translator/
52 http://www.diffchecker.com
53 http://languagetool.org/ruleeditor/
54 http://www.unicode.org/conference/about-conf.html
55 http://www.slideshare.net/YamagataEurope/dita-translatability-best-practices
56 http://code.google.com/p/pseudolocalization-tool
57 http://onlamp.com/pub/a/php/2002/11/28/php_i18n.html
58 http://help.transifex.com/features/formats.html
59 http://www.localisation.ie/resources/courses/summerschools/2012/WindowsPhone
Localisation.pdf
4 Localization basics
4.1 Introduction
Various steps in traditional globalization workflows were introduced in the
previous chapter, specifically in Section 3.2.3. The word traditional is used here
to refer to proven and scalable workflows, which have been used extensively by
multiple companies for the publishing of localized products in multiple languages.
An example of such workflow is shown in Figure 4.1.
The steps of this workflow include the internationalized creation
(including the marking) of source content (be it software strings or structured
documentation), the possible extraction of this content into a format that can
be easily translated into one or multiple target languages, the actual translation
of the content, the merging of the translated content back into the original
file(s) and finally some post-processing (including quality assurance testing) to
make sure that no problems were introduced during any of the previous steps.
Since internationalization was covered in the previous chapter, the present
chapter focuses on all of the localization-related steps: extraction, translation,
merging, building and testing, when applied to various content types pertaining
to an application’s ecosystem, including software content, user assistance and
information content. This chapter focuses on localization steps and processes
rather than on the translation technology tools that may be used to perform or
support the actual translation task, which will be the focus of Chapter 5.
I18N Localization
1 2 3 4 5 6
Create Extract Translate Merge Build Test
4.2.1 Extraction
By default the xgettext string extraction tool introduced in Section 3.2.3
generates a catalog file using the PO format for each file containing source code
or templates. This approach can be quite cumbersome for projects containing a
large number of files so it is often preferable to group all translatable strings into
a single package. This grouping can be achieved very easily with the Django
framework thanks to the makemessages tool, which examines every file from
a project and extracts any string that is marked for translation.3 While doing so,
it creates or updates a catalog file in a specific directory, specifically the locale\
language_code\LC_MESSAGES directory where language_code corresponds to
the language code of a particular locale (say de for German). Once catalog files
have been created, they are ready to be translated as discussed in the next section.
• placeholders
• markers for hotkeys
• HTML fragments
• tone
• abbreviations
• terminology.
This slightly modified example shows that the image import has been enhanced
for accessibility reasons by adding alternative text as suggested in Section 3.3.1.
In this example, the value of the alt attribute must be translated to make sense
in the target language. A similar approach has to be adopted for the values of
href attributes in hyperlink or a elements.11 In the example above, should this
text be translated into Spanish, one could consider replacing the href=“http://
dummysource.com” part with href=“http://dummysource.es” so that users are
directly brought to a relevant section of the target site (i.e. without requiring
them to make an extra selection in the global gateway of the target site). Whether
such a replacement is necessary or desirable should be clearly indicated in the
translation guidelines.
Tone is also a common area of focus in software strings’ translation guidelines
since the target user must be addressed in a consistent manner that corresponds
to their expectations. Formality levels may vary from one language to another (or
from one application to another). For instance, the Spanish translation guidelines
used at Twitter advocate referring to ‘the user as tú, not vos or usted, (…) [k]
eep[ing] the tone informal, but [without] us[ing] local or regional slang or words
that may not mean the same in all countries.’12 On the other hand, the German
translation guidelines for the Microsoft Windows phone platform recommend a
Localization basics 89
style that is both direct and personal: ‘For German, the formal second person is
to be used (Sie instead of du), as the target audience prefers to be addressed in a
formal, professional way and is not likely to want to see du all over their mobile
phone.’13
Abbreviations are especially relevant as far as mobile applications are
concerned as space constraints may require specific strings to be shortened
during the translation process. Official abbreviated forms may therefore be
provided in translation guidelines. Finally, specific guidance is likely to be
provided for application- or domain-specific terminology, be it in a specific
section of the guidelines or as a terminology glossary. While further discussion
will be provided on this topic in Section 5.4, it is worth keeping in mind that
for specific applications, technical accuracy is one, if not the most, important
characteristic of the translation process so that the user is able to navigate and use
the application as it was originally intended by the developer(s). For this reason,
Dr. International (2003: 325) reminds us that ‘without some in-depth knowledge
of the product, a localizer won’t be able to make sense of the source text, and thus
won’t be able to translate the text accurately in the target language.’ Obviously
some applications are much more technical than others, so advanced technical
skills are not always critical.
$ Is locale
de/LC_MESSAGES:
dj ango.mo dj ango.po
es/LC_MESSAGES:
dj ango.mo dj ango.po
f r/LC_MESSAGES:
dj ango.mo dj ango.po
hotkeys or shortcut keys that are often used in desktop applications (rather than
mobile applications where touch input is favoured). Such keys are associated with
specific word letters (e.g. F from File so that a file menu can be accessed using a
mnemonic key combination such as ALT + F instead of using the mouse to click
the menu). Hotkeys can be expressed in various different ways depending on
the programming language and graphical user interface (GUI) framework used.
For instance, some applications tend to rely on the & or _ character in a string
to indicate that the following letter is a hotkey (e.g. &File). From a translation
perspective, this character needs to be preserved in the target language by making
sure that conflicts do not occur. The developer who creates strings in the source
language has to make sure that hotkey letters do not get duplicated. For instance,
if an application contains an Actions menu and an About menu, two hotkey
letters must be identified, as shown in Figure 4.2.
The minimalistic application presented in Figure 4.2 is written in Python using
the TkInter graphical user interface toolkit (which is one of the toolkits used to
build portable Python desktop applications). Even though it is very simple, this
example shows how hotkeys can be associated with different word letters in each of
the menus. To accomplish this disambiguation and avoid conflicts, the position of
specific word letters has to be specified in the source code as shown in Listing 4.2.
The code shown in Listing 4.2 is more complex than any of the examples
provided up to now in this book so some parts may be difficult to understand. This
code is presented, however, so that issues can be avoided during the translation
process itself. The lines of interest are lines 11, 12, 13 and 14. Two of these lines
are comments that indicate to translators the positions of the hotkeys in the
menu strings. Even though these positions are determined by the values of the
underline parameters on lines 12 and 14 (i.e. 0 and 1), these positions would not
be accessible to a translator who does not have access to the source code. This is
confirmed when extracting translatable strings and generating a messages.po file
as shown in Listing 4.3.
Localization basics 91
1import Tkinter
2 import sys
3from gettext import gettext as _
4
5class App(Tkinter.Tk):
6 def init (self):
7 Tkinter.Tk.__ init_(self)
8 menu.bar = Tkinter.Menu(self)
9 file_menu = Tkinter.Menu(menu.bar, tearoff=False)
10 file_menu2 = Tkinter.Menu(menu_bar, tearoff=False)
11 #Translators: hotkey is on first letter
12 menu_bar.add_cascade(label=_("Actions"), underlined, menu=file.menu)
13 #Translators: hotkey is on second letter
14 menu_bar.add_cascade(label=_("About"), underlined, menu=file_menu2)
15 file_menu.add_command(label="Quit", command=quit, accelerator="Ctrl+Q")
16 file_menu2.add_command(label="Exit", command=quit, accelerator="Ctrl+E")
17 self.config(menu=menu_bar)
18
19 self.bind_all("<Control-q>", self.quit)
20 self.bind_all("<Control-e>", self.quit)
21
22 def quit(self, event):
23 print "See you soon!"
24 sys.exit(0)
25
26 i f name == " main " :
27 app=App()
28 app.title("Hotkeys")
29 app.mainloopO
The first two lines in Listing 4.3 show the commands that are used to (i)
generate the messages.po file and (ii) view its content using the Linux tail
command so that only the last ten lines of the file are displayed. This file
contains two strings to translate with the position constraint clearly indicated
in the comments. In this particular example, a translator would have to come
up with two translations by taking into account the fixed position of the hotkey.
Finding an acceptable translation could, of course, become challenging if the
hotkey was in a position that could not be reached by the target string. This is
specifically the type of problem that would appear when testing the application,
as described in the next section. In conclusion, best practices would suggest that
it is the responsibility of the developer to clearly indicate how hotkeys should
be handled by the translators, who should in turn make sure that they follow
the recommendations that are provided either in comments or guidelines. More
complex situations can emerge as shown in lines 15, 16, 19 and 20 in Listing 4.2.
These lines are currently not marked for translation, which is why they do not
appear in Listing 4.3. A close inspection, however, reveals that they do contain
translatable strings on lines 15 and 16: Quit, Exit, Ctrl+Q and Ctrl+E). Such
strings are associated with a different type of key combination. Whereas the
92 Localization basics
$ xgettext -c tk.py
$ tail messages.po
#. Translators: hotkey is on first letter
if: tk.py:12
msgid "Actions"
msgstr ""
first example used the position of specific characters, these strings rely on the
values passed on lines 19 and 20, specifically “<Control-q>” and “<Control-e>”.
When these key combinations are used, the program is exited. In this example,
the translation process is slightly more challenging. As in the first example the
mnemonic association should ideally be preserved in the target language by
avoiding conflicts. However, there is an additional constraint since the chosen
key combination must be supported by the GUI toolkit and the environment
of target end-users. If a special key is chosen during the translation process (e.g.
a key corresponding to an accented character), problems may arise if the GUI
toolkit cannot process such keys (because it has not been fully internationalized)
or if one of the target end-users uses a different keyboard from the one used by the
translator. Again, this type of problem can be detected during a quality assurance
testing step, which is the focus of the next section.
4.2.4 Testing
When the architecture of an application does not follow internationalization
principles or best practices, unexpected problems are likely to arise during the
quality assurance process (assuming a quality assurance process is in place in
the global delivery workflow). A localization quality assurance step can also
be referred to as localization testing because it may not be sufficient to check
that translated text displays correctly in a localized application. The quality
assurance process can be broken down into several areas, including functional
testing, compliance testing, compatibility testing, load testing and localization
testing. Actually separating localization testing from other testing types can
be misleading because every aspect of an application (be it its functionality,
compliance with norms or standards, or integration with other applications) may
be impacted by the localization process. For instance, some core functionality
may be adapted during the localization process, as explained in the next chapter
in Section 6.3.3. Whenever an application undergoes such adaptation, additional
testing is required. Compliance with norms or standards may also be impacted by
the localization process because some norms may be locale-specific. Finally, the
integration with third-party services requires specialized testing when third-party
Localization basics 93
services exhibit specific characteristics. For example, an application integrating
with online banking systems may require various testing configurations depending
on the countries where the banking systems are located. Examples of tests to be
performed on a localized application may include the following:
In the example used earlier in this chapter, the NBA4ALL application would
have to be tested using a combination of operating systems and Web browsers to
ensure that the core functionality is working regardless of the combination used.
For instance, the language list should appear whenever a user clicks or touches
the language icon. The fact that the user is using a localized operating system or
Web browser should not affect this core functionality. Other types of checks are
related to user input and output. For instance the NBA4ALL application allows
users to filter items based on keywords so that any character provided by the
user should be handled correctly regardless of the language used. Testing for all
of these potential issues can be time-consuming, especially if the application is
being localized into multiple languages and if multiple updates to the source code
happen during a project as explained in Section 4.2.6.
While these aspects are crucial in releasing truly global applications,
translation-related testing often focuses on checking on the display of translated
text. As mentioned in Section 3.2.4, problems resulting from string concatenation
and expansion (or worse, lack of translation) have an immediate negative visual
impact, so it is easy to fix these first. But very often, more fundamental problems
may exist (e.g. wrong text direction for languages such as Arabic or Hebrew)
and such problems can truly affect the end-user’s experience. Assigning severity
levels to all of these issues is therefore an integral part of the localization quality
assurance process. In order to solve such problems, various types of testing can
be used, ranging from manual to fully automated. Manual testing involves going
through the various screens or pages of an application to check that translated
text displays correctly and that it is not misleading for an end-user. After all, the
translation process may have occurred in a context-free environment, so it is not
unusual to find mistranslations in localized applications, especially when strings are
short and ambiguous (e.g. does the string Share drive refer to the sharing of a drive
or to a drive containing a share). To work around translation issues originating
from a lack of context, an alternative localization approach will be presented
in Section 4.2.8. A manual testing step may also include functionality-related
checks to ensure that the application behaves according to local conventions.
For instance, if one of the application’s screens allows the user to sort some
information (e.g. in a tabular format), then the sorted results should correspond
94 Localization basics
to what’s expected in the target locale (i.e. the order should not necessarily be the
same as the one used in the source locale). Obviously such manual tests are prone
to error and extremely tedious (especially when the application changes very
often), so it is common to resort to semi-automated or fully automated testing
procedures to verify the functionality and display of localized applications. An
example of a tool that can be used to automate this process is Huxley.14 This
tool can be used to automatically monitor browser activity, taking screenshots for
each visited page and informing the user when these pages change. This means
testing can be performed on subsets of an application instead of re-testing an
application from scratch every time a new build is available. Another cloud-
based service that can be used to automate testing on multiple combinations of
platforms and Web browsers is the service offered by Saucelabs.15
Solving clipped text problems related to string expansion can be achieved
in a number of ways. As mentioned in Section 3.2.4, the best way to avoid this
type of problem is to use a responsive format that does not use fixed dimensions
in the source. If this is not possible, translations may have to be shortened by
possibly using abbreviated forms. Another option is to use custom layouts for
the target languages that require longer or shorter strings. While some European
languages will be prone to string expansion (e.g. French and German when the
source language is English), some Asian languages (e.g. Chinese) tend to be more
compact so using a one-size-fits-all approach is often sub-optimal. Custom layouts
can be created by resizing some of the User Interface elements, an approach which
was made very popular by dedicated localization tools, namely Alchemy Catalyst
and SDL Passolo, when access to the entire source code or binary is possible as
explained in the next section.
4.2.7 Automation
To conclude this section, it is worth highlighting that most of these steps are
often automated in localization workflows. Having to manually create a set of
catalog files by running a given command or having to merge translated files
into master files by running another command can be a tedious, prone-to-errors
process. For these reasons, these steps are often automated using programs or
scripts, which can be either scheduled on a regular basis (say every day at a given
time) or triggered when a specific action occurs. For instance, it is very common
to link the activities taking place within a version control system (used by
developers) to those of an online translation management system. One possible
way to fully automate this sequence of actions would be to set up the execution
of a script (for example, Django’s makemessages tool) every time changes are
validated (or committed) in the version control system used to manage the source
files of a global application. This script could also validate that all files have been
successfully generated and upload them to an online translation management
system. Another script could then monitor this translation management system
at regular intervals to check whether new translations are available, and if they
are, download the translated files and execute the compilemessages tool to
make them available to the application. Multiple variations of this set-up are
possible, but the key point is that manual touch points can easily be avoided
to speed up the localization process. A variation of this approach is to have the
translation management system monitor the version control system to detect any
file change. When a file change is detected, the translation management system
can automatically update the translation projects containing those source strings
that have been modified. Translators who usually work on these projects can then
be automatically notified that new translations are required.
A final possibility is to manage the localization workflow directly from the
environment where code is developed. This is the approach taken by localization
providers such as Get Localization or Microsoft, which offer tools to keep the
files containing translatable strings synchronized with the translated files.17, 18
Such tools give developers the ability to automatically upload resources requiring
translation to an online localization project repository. This approach presents the
advantage of eliminating a number of steps and stakeholders between developers
and translators, but may introduce unnecessary translations if the source strings
are changed regularly before an actual product release.
Localization basics 97
4.2.8 In-context localization
The previous sections have focused on a rather sequential localization model,
whereby source strings are extracted, translated and then merged back into target
resources. While this model offers certain benefits, such as scale, it also has flaws.
The first one is that many stakeholders are involved in the process, which means
that problems can occur at various points, especially if a strong quality assurance
component is not in place. The second flaw, possibly the most serious one, is that
translation often occurs out-of-context, which means that the final linguistic
quality of the content may not match customers’ expectations. Obviously some
of these linguistic problems can be resolved by having a quality assurance process
in place as well as flexible translators (who may have to re-translate strings that
have been mistranslated), but this is not as efficient as getting good translations
from the start. To work around this problem specifically for Web applications,
a new model has recently emerged whereby translatable source strings are
extracted from the rendered pages of an application using techniques such as CSS
selectors or XPath expressions (Alabau and Leiva 2014: 153). These extraction
techniques, which can be described as surface techniques, may then be coupled
with just-in-time or in-context translation tools.
For example, the Mozilla Foundation launched a project called Pontoon,
which is a Web-based, What-You-See-Is-What-You-Get (WYSIWYG)
localization (l10n) tool that can be used to localize Web content. This project is
based on open-source tools such as gettext and offers translators the possibility
to translate strings by looking at the Web page containing these strings. The
project offers an online demo site, where users can provide test translations for a
simple page.19 Figure 4.3 shows how a Web page can be split into two parts: the
content part at the top and the translation toolbar at the bottom.
The translation toolbar offers several features, including the ability to use
machine translation and translation memory tools as well as the suggestions
from other users. This toolbar can be easily minimized to navigate to parts of
the page that have yet to be translated. While the toolbar can be useful to use
external tools, it does not inform translators about potential layout problems
that may result from their translations. This is where the interactive translation
functionality of Pontoon comes in. Pontoon leverages the power of HTML5,
which can easily transform any read-only element into an editable one using the
contenteditable content attribute.20 Figure 4.4 shows how a textual page element
can be clicked, edited and saved. Once the text is saved, the page displays the
updated text, which may reveal some layout issues, e.g. the text does not fit
into the original element. At this point, the translator may decide to find an
alternative, shorter translation or report the problem to the developer in order to
try to have them increase the element’s size.
More information on how to accomplish several tasks (such as publishing the
translation results) can be found on the project’s Web site.21 It remains to be
seen whether this in-context localization model will prove as successful in the
long term as it was for the localization of desktop applications as discussed in
Section 4.2.5.
To conclude this section, it is worth highlighting that it is now more and
more difficult to differentiate software strings from user assistance content since
user assistance content is sometimes embedded in the application itself. This is
specifically the case for Web applications, which can rely on graphical elements
such as tooltips or pop-ups to provide context-sensitive help. It is also possible to
include getting started information the first time an application is started so that the
user is given a quick tour of the main features of an application.
4.3.2 Segmentation
Regardless of the ultimate goal sought when creating a translation kit, the
segmentation step is extremely important. The segmentation step is used to break
the original content into smaller chunks so that the reuse from a translation
memory becomes more effective. Another role of the segmentation step is to
ensure that translatable elements are identified, by possibly relying on pre-defined
Localization basics 101
or custom filters.32 Depending on the final translation kit format used, however,
segmentation may not be consistent. For instance, at the time of writing Rainbow
did not support the segmentation of source content into PO packages.33
As explained in the next two sections, the splitting or segmenting of the
original source content can have a profound impact both on the translation
process and translation leveraging process. This means that special attention
must be paid to the way the segmentation is performed. While a naive approach
to segmentation for languages such as English would consist of using punctuation
marks such as full stops or exclamation marks followed by a space, problems
arise with abbreviations (such as Dr.) or unusual product names containing
punctuation marks (such as Yahoo!). Sentence segmentation is therefore often
dependant on the type of text that is being translated, and custom rules are often
required to adapt existing rules to new text types.
Segmentation rules can be created in a number of ways, either by using a data-
driven approach or using a rules-based approach. An example of a data-driven
approach is presented by Bird et al. (2009), whereby sentence segmentation is
handled as a classification task for punctuation. Whenever a character that could
possibly end a sentence is encountered, such as a full stop or a question mark,
a decision is made as to whether it terminates the preceding sentence.34 This
approach relies on a corpus of already segmented texts, from which characteristics
(or features) are extracted. Such characteristics may include information such as
the token preceding a given token in a sentence, whether the token following
a given token in a sentence is capitalized or not, or whether the given token is
actually a punctuation character. These features are then used to label all of the
characters that could act as sentence delimiters. Once a labelled feature set is
available, a classifier can be created to determine whether a character is likely
to be a sentence delimiter in a given context. This classifier can then be used to
segment new texts.
Segmentation rules can also be manually defined using the SRX Segmentation
Rule eXchange Standard, which is an XML-based standard that allows for the
exchange of rules from one system to another.35 SRX is defined in two parts:
a specification of the segmentation rules that are applicable for each language,
represented by the languagerules element and a specification of how the
segmentation rules are applied to each language, represented by the maprules
element. Using this standard, two types of rules can be defined: rules that identify
characters that indicate a segmentation break and rules that indicate exceptions.
For example, one breaking rule can be defined to identify a full stop followed by
any number of spaces and a non-breaking rule can be defined to list a number of
abbreviated words. Examples of such rules are presented in Figure 4.5, where the
Okapi Ratel program is used to create and test rules.36
SRX rules can be created using regular expressions conforming to the ICU
(International Components for Unicode) syntax.37 Rules will obviously vary from
one natural language to another. While a common set of rules may be reused for
certain languages, language-specific exception rules will be required. In Figure 4.5,
the English example shows how the default segmentation rules provided by the
102 Localization basics
• capitalization
• spacing
• punctuation
106 Localization basics
• tone (formal versus familiar)
• voice (active versus passive; direct versus indirect)
• gender and articles (especially for loan words)
• compounding
• terminology.
Most of these categories are self-explanatory, so the best advice for translators
is to get familiar with the guidelines that have been defined by the translation
requester or buyer. In some situations, these guidelines are also referred to as best-
practices that may have been shaped by a community of translators. For instance,
specific conventions for the translation of Mozilla support documentation into
French include choices for using the imperative in lists of steps and the infinitive
in headings.40 It should be stressed, however, that the amount of reference
materials provided to translators can sometimes be overwhelming, especially
if some conflicts occur. For instance, Microsoft (2011: 7) lists four normative
sources to consult from a spelling and grammar perspective, advising that ‘[w]hen
more than one solution is allowed in these sources, [translators should] look for
the recommended one in other parts of the Style Guide.’ This problem can be
compounded by the fact that additional translation materials (such as translation
memory segments) provided to translators may be at odds with such guidelines.
Knowing what usage to adopt in a specific project can therefore be challenging so
checking with the translation requester is usually recommended. Locale-specific
guidelines can also be used to tone down the meaning of the source text. For
instance, Microsoft (2011: 40) warns translators that ‘[a]bsolute expressions
leaving no room for exceptions or failure, like solves all issues, fully secure, at any
time are a serious legal risk on the French, Canadian, and Belgian markets.’
4.3.6 Testing
In the same way that translated software strings require functional and visual
testing once they are merged back into an application’s code base, translated
user assistance content must be validated and tested to ensure that no problems
were introduced during the translation process (e.g. deleting important XML
elements). Examples of validation include checking that the translated files
are properly encoded, that they are well-formed and can be used to render the
final documents. Tools such as Rainbow can assist with such checks, for instance
with the validation of XML files.41 Additional testing will also be required if the
final documentation contains special components, such as an index or a search
functionality as described in the next section.
4.5 Conclusions
This chapter covered a lot of ground, focusing on the localization of various
digital content types. With the advent of modern Web applications, the
distinction between software strings, documentation content and information
110 Localization basics
content is no longer always clear-cut. These types of content were reviewed in
order to detail key localization processes and introduce some of the tools that
can be used to facilitate such processes. Regardless of the content type, typical
localization processes involve three fundamental steps: extraction, translation
and merging. As shown in this chapter, modern localization processes try to
abstract most of the complexity that was characteristic of large localization
projects in the 1990s and 2000s. These modern processes tend to rely on
flexible workflows where content updates are handled continuously. In-context
localization techniques are also popular in order to minimize the amount of
quality assurance effort required to develop quality products. Such techniques
benefit the translation community, who instead of relying on isolated chunks
of words to translate, can focus on maximizing the end-user’s experience by
producing translations that fully match the context in which source strings
occur.
Documentation and information content, however, may not be limited to
textual content. As mentioned by Hammerich and Harrison (2002: 2), the term
content refers to the ‘written material on a Web site’, whereas the ‘visuals refer to
all forms of design and graphics’. This type of content will be covered in detail
in Section 6.1.
4.6 Tasks
This section is divided into three tasks:
If you cannot find a .po file for a project that looks interesting, you should proceed
to the next step.
Notes
1 http://www.alchemysoftware.com/products/alchemy_catalyst.html
2 http://www.sdl.com/products/sdl-passolo/
3 https://docs.djangoproject.com/en/1.7/topics/i18n/translation#localization-how-to-
create-language-files
4 https://translate.evernote.com/pootle/pages/guidelines/
5 https://www.mozilla.org/en-US/styleguide/communications/translation/
6 http://msdn.microsoft.com/library/aa511258.aspx
7 https://translate.evernote.com/pootle/pages/guidelines/
8 A less permissive version of HTML, known as XHTML (Extensible HyperText
Markup Language), exists. This version will be parsed by XML processors so syntax
errors will matter.
9 http://www.w3.org/TR/html-markup/strong.html
10 http://www.w3.org/TR/html-markup/img.html
11 http://www.w3.org/TR/html-markup/a.html
12 https://translate.twitter.com/forum/forums/spanish/topics/3337
13 http://www.microsoft.com/Language/en-US/StyleGuides.aspx
14 https://github.com/facebook/huxley
15 https://saucelabs.com
16 https://blogs.oracle.com/translation/entry/agile_localization_more_questions_than
17 http://blog.getlocalization.com/2012/05/07/get-localization-sync-for-eclipse/
18 http://msdn.microsoft.com/en-us/library/windows/apps/jj569303.aspx
19 https://pontoon-dev.mozillalabs.com/en-US
20 http://www.whatwg.org/specs/web-apps/current-work#contenteditable
21 https://developer.mozilla.org/en-US/docs/Localizing_with_Pontoon
22 http://officeopenxml.com/
23 http://opendocument.xml.org/
24 http://www.mediawiki.org/wiki/Help:Formatting
25 http://docutils.sourceforge.net/rst.html
26 http://johnmacfarlane.net/pandoc/
27 https://readthedocs.org/
28 http://www.xml.com/pub/a/2007/02/21/oaxal-open-architecture-for-xml-authoring-
and-localization.html
29 http://itstool.org
30 http://manpages.ubuntu.com/manpages/gutsy/man1/xml2pot.1.html
31 http://www.opentag.com/okapi/wiki/index.php?title=Rainbow
32 http://www.opentag.com/okapi/wiki/index.php?title=HTML_Filter
33 http://www.opentag.com/okapi/wiki/index.php?title=Rainbow_TKit_-_PO_Package
Localization basics 115
34 http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html#sec-further-examples-of-
supervised-classification
35 http://www.ttt.org/oscarstandards/srx/srx20.html
36 http://www.opentag.com/okapi/wiki/index.php?title=ratel
37 http://userguide.icu-project.org/strings/regexp
38 http://www.opentag.com/okapi/wiki/index.php?title=Scoping_Report_Step
39 All Microsoft style guides are available from: http://www.microsoft.com/Language/
en-US/StyleGuides.aspx
40 https://support.mozilla.org/fr/kb/bonnes-pratiques-traduction-francophone-sumo
41 http://www.opentag.com/okapi/wiki/index.php?title=XML_Validation_Step
42 http://news.cnet.com/8301-1023_3-57422613-93/google-translate-boasts-64-
languages-and-200m-users/
43 http://www.welocalize.com/dell-welocalize-the-biggest-machine-translation-
program-ever
44 http://bit.ly/dell-alienware-us
45 http://bit.ly/dell-alienware-fr
46 https://github.com
47 https://bitbucket.org
48 The word ‘project’ is used instead of ‘product’ because of the uneven maturity level of
the code posted on these platforms.
49 http://wiki.maemo.org/Internationalize_a_Python_application#With_poEdit
50 https://www.transifex.com/signup/
51 http://translate.evernote.com/pootle/projects/kb_evernote
52 http://pootle.translatehouse.org/
53 https://translate.evernote.com/pootle/pages/getting-started/
54 http://translate.sourceforge.net/wiki/pootle/live_servers#public_pootle_servers
Note, however, that some of these projects may contain software strings projects
rather than user assistance projects
55 https://translate.evernote.com/pootle/pages/guidelines/
56 http://docs.translatehouse.org/projects/pootle/en/latest/developers/contributing.html
57 https://translate.twitter.com/forum/categories/language-discussion At the time
of writing, specific English to target language guidelines could be obtained from
this URL by clicking a language and then a link starting with Style guidelines for
translating Twitter into
58 https://support.twitter.com/
59 http://developer.android.com/resources/tutorials/localization/index.html
60 http://developer.apple.com/library/ios#referencelibrary/GettingStarted/
RoadMapiOS/chapters/InternationalizeYourApp/InternationalizeYourApp/
InternationalizeYourApp.html
61 https://www.drupal.org/
62 http://www.joomla.org/
63 http://office.microsoft.com/sharepoint/
64 https://www.drupal.org/project/lingotek
5 Translation technology
The goal of this chapter is to focus on the technology that is linked to content
translation from one language into another. Translation management systems and
translation environments are the focus of the first two sections of this chapter
since they provide most of the infrastructure required for the translation step in
localization workflows. However, it is difficult to introduce translation management
systems without presenting specific translation workflows. Very often translation
management systems provide a workflow engine used to define a series of steps that
allow content to flow up and down the localization chain. Without such systems,
translation processes tend to be inefficient. This does not mean, however, that using
such a system will guarantee smooth localization projects. If a system is chosen for
the wrong reasons or is deployed in a hasty manner without providing appropriate
support to its users, its adoption and subsequent use may lead to inefficiencies.
Understanding the main characteristics of such systems is therefore crucial for
anybody who is in charge of using or managing localization workflows.
In the third and fourth sections of this chapter, tools that are used to reuse
previous translations and handle terminology during localization processes are
covered. While terminology is at the core of most translation tasks, it is particularly
crucial in the localization of Web and mobile applications, since users tend to
interact with applications through translated strings. The fifth section of this
chapter focuses on machine translation, which is used increasingly to support,
enhance, and in some cases replace the translation step in localization workflows.
When used correctly, this controversial technology can boost translation
productivity and increase translation consistency. When used incorrectly, this
technology can have serious consequences (ranging from generating humorous
translations to producing life-threatening inaccurate translations). From a
translation buyer and translator’s perspective, it is therefore essential to know
when and how this technology should be used. The sixth section of this chapter is
dedicated to a workflow step that is closely related to machine translation: post-
editing. With the growing popularity of machine translation, post-editing is also
becoming more and more mainstream. This topic is discussed in a separate section
because it somewhat differs from the traditional act of generating a translation. A
review of post-editing tasks and tools is provided in this section. The last section
extends the section on post-editing by covering quality assurance tasks that are
Translation technology 117
performed in localization workflows, especially during the translation process.
While the concept of translation verification is not specific to localization,
localization-specific characteristics require the use of dedicated tools to ensure
that quality standards are used and adhered to throughout a localization project –
the ultimate goal being the release of quality localized applications.
• Finding out in which countries the app is used even though it has not been
localized into the language that is primarily used in that country.
• Identifying similar apps that are popular in countries where their application
is not yet available.
• Selecting which target languages to translate into.
• Identifying and placing an order with a professional translation vendor who
will be able to complete the translation of their application’s strings.
• Communicating with the translator(s) to clarify any questions that may arise
during the translation process.
• Downloading the file(s) containing translated strings.
As far as the localization of apps is concerned, the first and third environments
are unlikely to be used for reasons that have been detailed in the previous chapter.
The choice of a translation environment (or a combination of translation
environments) depends on multiple factors, including:
5.2.1 Web-based
Some of the translation management systems that have already been covered
in Chapter 4 (e.g. Transifex and Pootle) have their own Web-based translation
environment.16, 17
These systems allow translators to accomplish (some of) the following tasks:
• Translate segments that have been extracted from a source content set (be it
a set of software strings or a set of help content).
• Connect to third-party systems that will provide translation suggestions, such
as dictionaries, translation memory systems, machine translation systems. If
these systems have been correctly configured, they should help make the
translation process more efficient.
• Download a translation package containing both source content and
translation suggestions to work offline.
• Upload a translation package once the work has been completed offline.
Translation technology 125
• Check their translations to help produce the quality level that meets
customers’ requirements. Checks may include the detection of spelling,
grammar or style mistakes, as well as the identification of problems that
would affect the build process (e.g. missing or broken tags, duplicated hotkey
markers).
• Get paid for the work produced.
5.2.2 Desktop-based
A large number of translation environment tools exist, ranging from free open-
source programs (such as Virtaal or OmegaT) to large, commercial suites such
126 Translation technology
as SDL Trados Studio.19 Some of these programs are based on a client/server
architecture, which means that the translations and translation resources that
are generated and used can be synchronized across a network. Some of the
functionality of these programs can sometimes be available to standard word
processing environments (such as MS Word), which are favoured by a number of
translators for productivity reasons (Lagoudaki 2009). One of the most common
translation features used in this manner is that of translation memory lookup,
which allows translators to translate a document in the environment of their
choice while leveraging a translation memory database, as briefly described in
the next section.
The first three segments are semantically identical while differing in terms
of punctuation, case, lexical choice and word order. The meaning of the fourth
segment, however, is completely different from that of the other segments but
it shares many words (and sequences of words) with the first segment. From
Translation technology 127
a translation productivity perspective, leveraging the translation of the first
segment when translating the second segment seems beneficial since no (or little)
editing would be required. Leveraging the translation of the first segment when
translating the fourth segment, however, may not be as effective because of the
semantic differences that exist between the two source segments. This cognitive
challenge is likely to be enhanced when translating into morphologically rich
languages with case-based inflection. For instance, two word sequences may be
identical in a given language but different in another language depending on the
role of this word sequence (e.g. subject vs. object). Another cognitive challenge
may arise if the translation memory tool does not include any word alignment
visualization between the source and target segments. In the example above, one
of the differences between the first and fourth segments is the word click. It might
be useful for a translator to know that this word is missing from the fourth segment
when leveraging the first segment. Having access to this information (possibly
through some colour-coding visualization scheme) may help the translator decide
whether this word should be removed from the translation suggestion. However,
it might be equally (or even more) useful to know where the translation of this
word is in the translation suggestion. Having to read (or scan) the translation
suggestion to find (and possibly delete) the translation of the word click does not
seem the most efficient use of technology.
Another aspect to keep in mind when selecting a translation memory tool is
its ability to export the contents of a project so that they can be used in another
environment. While the most important parts of a translation unit are the source
and target segments, it can sometimes be useful to export additional metadata
as well (e.g. creation date of the translation unit, author of the source segment,
author of the target segment, number of times the translation unit has been
leveraged in translation projects). Exporting translation memory data is often
performed using the TMX standard that was covered in Section 2.5.2.
5.4 Terminology
This section focuses on terminology, which is at the heart of the translation
process in the localization of applications. This section is divided into the
following sections: first, the importance of terminology is discussed from a
localization perspective. The second and third sections focus on the extraction
of terminology, or more precisely the extraction of candidate terms. The fourth
section covers various ways in which translations can be acquired once candidate
terms have been validated. The final section explains how extracted terms and
their translations can be made available in terminology glossaries, which can
then be used during the translation or quality assurance process.
16 sentence
15 The
12 documentation
11 user
10 installation
9 Before
8 After
8 Application Platform
8 Enterprise Application Platform
8 If
8 It
8 JBoss Enterprise Application Platform
8 Notes
8 Platform
8 developers
to keep shorter strings instead of longer strings. This decision is often influenced
by the way in which terms are going to be translated in various target languages.
In order to avoid undesirable concatenation issues (whereby the translation of
term A and the translation of term B cannot be glued together to produce a
correct translation for term AB), it is sometimes preferable to keep longer terms
when validating term candidates.
Another issue can be seen in Listing 5.1, whereby some candidate terms seem
to contain unlikely terms, such as Before. This is due to the fact that Rainbow
does not use any linguistic knowledge to extract candidate terms, so the output
tends to be noisy especially if stopwords, which are (undesirable) words filtered
Translation technology 131
out before or after processing text, are not used to refine the results. Examples of
stopwords typically include function words (such as the or during) and common
content words that are not domain specific. To work around this problem more
sophisticated tools can be used in order to label each word with a part-of-speech
tag before performing the actual extraction. This is the approach that is used
by LanguageTool, which was introduced in Chapter 3 in Section ‘Language
checkers’. The extraction of candidate terms based on part-of-speech tags may
be available in commercial or open-source tools. As mentioned in Chapter 2, the
Python programming ecosystem is rich in terms of additional, focused tools that
supplement the core language. These tools are often known as libraries since they
provide specific functionality, which would take a significant amount of time
to develop from scratch. One of these libraries is the Natural Language Toolkit
(NLTK) (Bird et al. 2009). This library allows users to perform in sequence
some of the tasks that are required to extract candidate terms, including text
segmentation, sentence tokenization, part-of-speech tagging and chunking.23
These techniques allow the creation of chunk patterns to extract only those
substrings that correspond to specific sequences of tags, such as sequences of
nouns (e.g. at least one common or proper noun, either in singular or plural
form). Once these strings are extracted, a final step is required to group them in
such a way that term variants (e.g. singular and plural) are merged together before
displaying frequency information. Merging singular and plural forms of strings
can be described as a normalization process, whereby the canonical, dictionary
form of a word is used to identify variants. This process, which is known as
lemmatization, can be achieved with NLTK using the WordNet resource.24 As
shown in Listing 5.2, the candidate terms and frequencies obtained using this
approach differ substantially from those presented in Listing 5.1.
Once candidate source terms are extracted, translations must be identified if
the objective of the extraction is to provide translators with a glossary of preferred
translations. This step is described in the next section.
sentence 16
developer 8
user 8
server 7
Notes 6
documentation 6
directory 6
chapter 5
JBoss Enterprise Application Platform 5
something 5
information 5
CDs 4
test lab 4
voice 4
installation 4
For the sake of simplicity the script is used in this example with the -t option so
that the script stops after ten seconds. It is possible to let the script run for much
longer but this may not be advisable in a low-resource computing environment.
Also, the files used in this example had not been tokenized. Once the script is
run, the results can be found in a text file called any.out. This file can then be
searched to look for term translations as shown in Listing 5.3.
By default the output of Anymalign contains three values when two input files
are selected. The first and second values are translation probabilities, where the
first value is the probability of the target given the source and the second value
the probability of the source given the target. The third value (which is used to
sort the results) corresponds to an absolute frequency.
In Listing 5.3 the grep tool is used with the -P option to look for patterns
in the output file using regular expressions. Since the output may contain
phrases containing multiple words, the ^ and \t delimiters are used to narrow
Translation technology 133
$ grep -P "~developer\t" any.out
developer développeur - 1.000000 0.800000 4
$ grep -P "~server\t" any.out
server serveur - 0.941176 0.592593 16
server Serveur - 0.058824 0.041667 1
$ grep -P "~user\t" any.out
user utilisateur - 0.454545 0.277778 5
user l ’utilisateur - 0.363636 0.800000 4
user user - 0.090909 1.000000 1
user nom d ’utilisateur. - 0.090909 1.000000 1
$ grep -P "'production use\t" any.out
down the results. The three commands used for the terms developer, server and
user return up to three phrase pairs, with the most frequent one looking like a
good translation in the first three cases. The fourth command used for the term
production use does not return any result but this is not too surprising considering
(i) the script was run for a very short time and (ii) the data files used for the
bilingual extraction (i.e. KDE documentation) do not fully match the topic of
the file used for the monolingual extraction (i.e. JBoss). This second issue is very
common in localization projects (and more generally in translation projects)
since new terms that have never been translated before will keep appearing.
In this case, it will be the responsibility of a translator to provide a translation
(using traditional translation techniques such as borrowing or equivalence).
Defining the translation of a new, frequent term early on in a project is often
effective in order to avoid having to resolve translation inconsistencies at a later
stage.
As an alternative to the detailed process presented here, one may consider a
tool such as poterminology for the extraction of terminology from PO files.27
Ultimately one has to decide whether they are looking for a one-click solution
(which may or may not be customizable and extensible) or a framework to refine
existing approaches. While the latter is more demanding than the former in
terms of initial investment, it may pay off in the long term. Regardless of the
method chosen, it does not do any harm to have a detailed understanding of how
things work once a button is clicked.
While some time will be spent on these steps, Allen (2001) argues that it is
time well spent before post-editing is actually started if translation productivity
gains are to be achieved.
Finally, a post-processing module may be used to automatically correct (or
post-edit) the output of a machine-translation system. Several approaches exist
to accomplish this goal, ranging from rules-based to statistical. The concept of
automated post-editing was first presented by Knight and Chander (1994) and
further explored by Allen (1999) with a view to fixing systematic errors committed
by an MT system. When these MT errors cannot be fixed with dictionary entries,
they may be fixed using global search and replace patterns and regular expressions
(Roturier 2009).
The statistical methods used for machine translation are briefly covered in the
next section.
Data acquisition
The first step in building an SMT system is to determine which data to use
to create models that will be used for subsequent translations. For example, a
translator working in the pharmaceutical industry may be interested in creating
a system that will specialize in translating instruction leaflets for medicines.
As a general requirement, such a system must be able to deal with the lexical,
grammatical, stylistic and textual characteristics of this technical text type.
Using parallel data originating from a completely different domain or text type
(e.g. sports news) would therefore be almost useless since sports news terms
(and their associated translations) would be unlikely to appear in instruction
leaflets. There can be exceptions, of course, when sports news materials refer
to medicines used by athletes in certain contexts (e.g. in a doping scandal),
but in general the two domains and text types would be too different to
provide sufficient overlap. One should not forget that the phrase-based SMT
approach relies on phrases rather than structures, so phrases must have been
seen at least once if they are to be translated in the target language. Once a
precise translation scenario has been identified, it is possible to start looking for
relevant training materials.
Most (if not all) statistical machine translation systems expect a set of parallel
sentence pairs in order to compute alignment probabilities between source and
target segments. Translation memories are of course good sources to find such
sentence pairs but having access to translation memories that are large enough
to be useful can be a challenge. For example, the LetsMT! service recommends
at least 1 million parallel sentences for training a translation model and at least
5 million sentences for training a language model.38 These recommendations
are based on productivity tests showing productivity increases when larger
training sets are used (Vasiļjevs et al. 2012). Even for a freelance translator
who has worked for a number of years in a specific field of specialization, these
are quite large numbers if they only take into account the translation memories
they have built over the years. For many, including larger language service
providers or corporate users, it is therefore necessary to leverage other data
sources to supplement a default set of translation memories.
Various types of data sources exist, ranging from open to closed. Open data
sources include the aforementioned OPUS corpus. The SMT community also
organizes some translation competitions from time to time and they often make
data sets available in an open manner.39 Some of these data sets may be useful
in bulk or in parts to supplement existing translation memories. Other data
repositories operate in a closed approach, whereby data is only made available
to members (who may or may not have to pay subscription or download fees
to make use of the data). One such repository is hosted by the Translation
Automation User Society.40 This system is based on data upload from members
in order for these members to download specific data sets. Some services operate
in a hybrid manner whereby public and private translation memories are made
available (e.g. MyMemory).41 As mentioned in Section 5.3, the assumption
140 Translation technology
that translation memories contain high quality translations is not always true,
especially if translation memories are not maintained over time.
Data processing
Once relevant data has been identified, it must be converted in a format that
is compatible with the tools that will be used to build the models. In some
cases, parallel data is not available at the segment level, but at the document
level. For instance, some Web sites may contain relevant document pairs when
they have been localized into at least one target language. Some data processing
tools specialize in the acquisition of such Web sources, using some heuristics to
transform such documents into smaller parallel units as described in Smith et al.
(2013) and Bel et al. (2013).
Parallel data need to be prepared before they are used in training. This
involves tokenizing the text and converting tokens to a standard case. Some
heuristics may also be used to remove sentence pairs which seem to be
misaligned or long sentences. All of these steps are necessary to ensure that
reliable alignment probabilities are extracted from the training data. For
instance, case standardization is used to ensure that word or terms variants
do not dilute the probability of an alignment. If the training data contained
multiple variants of a source word (e.g. email and Email), probabilities
would be shared among these variants, thus possibly resulting in less reliable
translations.
Obviously some of these steps are language-dependant. For example, some
rough tokenization can be achieved for languages such as English by relying on
a small number of rules (using word spaces, punctuation marks and a small list
of abbreviations). For languages such as Chinese or Japanese, however, these
techniques will not work since these languages do not use spaces to separate
words. Instead, advanced dictionary-based word segmenters are required, which
may have an impact on the performance of the system (in terms of speed). For
languages that make heavy use of compounds (e.g. German), it is also often
preferable to use decomposition rules to make sure that good word alignment
probabilities are extracted. This is due to the fact that long, complex words tend
to appear less frequently than shorter words.
Training
The training is divided into two main parts: the training of the translation model
and the training of the language model. In order to train a translation model,
word alignment information must be extracted from sentence pairs. Once
these parallel sentences have been pre-processed, they can be word-aligned,
using a tool such as GIZA++ (Och and Ney 2003), which implements a set
of statistical models developed at IBM (Brown et al. 1993). Within the SMT
framework all possible alignments between each sentence pair are considered
and the most likely alignments are identified. These word alignments are then
Translation technology 141
used to extract phrase translations, before probabilities can be estimated for
these phrase translations using corpus-wide statistics.
The next step consists in training a language model, which is a statistical
model built using monolingual data in the target language. Since a language
model provides the likelihood that a target string is actually a valid sentence
in a given language, it offers a model of the monolingual training corpus and
a method for calculating the probability of a new string using that model. This
model is used by the SMT decoder to ensure the fluency of the translation output.
Moses relies on external toolkits for language model building, such as IRSTLM
(Federico et al. 2008) or SRILM (Stolcke 2002). One important factor to take
into account when building a language model is the maximum length of the
substrings (in terms of number of words or tokens) that should be used when
estimating probabilities. Such sequences of words (or tokens) are known as
n-grams, where n corresponds to the length of the phrase (e.g. two for a bigram). In
order to be able to differentiate between fluent and disfluent sentences, it is often
necessary to build models that rely on longer substrings from the training corpus.
While sequences of two or three words tend to be more useful than sequences
of one word, longer sequences suffer from a major problem: their frequency. It
is, however, possible to combine multiple language models built using different
string lengths in order to balance the need for flexibility and context sensitivity
(Hearne and Way 2011).
Tuning
Tuning is the slowest part of the process of building an SMT system even
though it only requires a small amount of parallel data (e.g. 2000 sentences).
This step is used to refine the weights that should be used to combine the
various features of an SMT system. In the previous section, the focus was on
two of these features: the translation model and the language model, but other
features are often used, such as a word penalty to control the length of the
target sentence (Hearne and Way 2011). The tuning process tries to solve an
optimization problem by using a set of sentences corresponding to an ideal
scenario. In this scenario, each sentence is associated with a good translation
(or possibly a set of good translations) so various weight combinations can be
tried and evaluated in order to determine the one that will produce translations
that are the closest to the reference translations. Such a technique is known
as the Minimum Error Rate Training (MERT) which was proposed by Och
(2003). The reliability of this technique is highly dependent on the method
that is used to determine whether two translations are close to one another.
As mentioned in Section 5.3, translations that are semantically equivalent or
related are not always close at a lexical level. Finding a reliable metric that
captures both meaning and structure acceptability is therefore an open research
question. This challenge is due to the fact that human evaluation itself is often
not 100 per cent reliable due to the many possible translations (Arnold et al.
1994). A number of metrics have been proposed over the years to try to address
142 Translation technology
the problem of evaluating machine translation in an automatic manner, as
discussed in the next section.
Evaluation
In order to bypass the alleged issues that are inherent to human evaluations
(i.e. cost, time), several automatic evaluation methods have been developed in
the last number of years. Most of these automatic evaluation methods focus on
the similarity or divergence existing between an MT output and one or several
reference translations. Generally the scores produced by these MT metrics are
meaningful at the corpus level (i.e. by generating a global score for a tuning
or evaluation set), rather than at the segment level. Examples of automatic
metrics include BLEU (Papineni et al. 2002), Meteor (Denkowski and Lavie
2011), HTER (Snover et al. 2006) or MEANT (Lo and Wu 2011). While
all of these metrics try to provide an assessment of the quality of translations
produced by MT systems, they are sufficiently different because they actually
capture various aspects of translation quality. For instance, BLEU focuses on
the overlap of n-grams (i.e. sequences of words) between the MT output and
the reference translations, thus being more informative about the fluency of a
translation rather than its adequacy. Meteor is a tuneable metric that, by using
external resources, tries to address some of the weaknesses of BLEU (which
relies on surface forms). These resources include synonyms, paraphrases and
stemming that are used to avoid penalizing some good translations that are not
close to reference translation from an edit distance perspective. HTER’s goal is
different since it measures the amount of editing that a human translator would
have to perform to transform an MT output into a valid reference translation
(by counting edit types such as insertions, deletions, substitutions and shifts).
Finally, MEANT evaluates the utility of a translation by matching semantic role
fillers associated with the MT output and reference translations, with a view to
capturing the semantic fidelity of a translation (instead of its lexical proximity
with a reference translation). Automatic evaluation metrics are often said to be
an inexpensive alternative to human evaluation (Papineni et al. 2002). However,
new sets of data require reference translations, which might be more expensive
to produce than performing a manual evaluation of the MT output, especially if
several reference translations are required to make the results more reliable. The
approach suggested by MEANT somewhat alleviates this requirement since it
relies on annotations provided by untrained monolingual participants. Despite
all of the research work that has been done in the area of machine translation
evaluation, no solution can provide a perfect way to gauge the quality of
individual translations. These approximations, however, can be used during the
tuning process to attribute weights to components of an SMT system and give
SMT developers a way to check whether their changes are bringing about some
improvements. While most of these tools do not have their own graphical user
interface, the Asiya Online toolkit provides an easy way to generate multiple
scores once files have been uploaded.42
Translation technology 143
In some cases, however, relying on a corpus-level score is not sufficient to
understand why an MT system generated a given sentence, or whether some
source modifications can have an impact on the MT output. In these situations,
different tools are required to visualize aligned sentences at the word level.
The X-ray tool is one of these tools, since it leverages some word alignment
information generated by the Meteor metric to identify differences between two
strings (Denkowski and Lavie 2011).
Tools
Some of these steps may seem daunting to people who are new to machine
translation. The good news is that it is now much more simple to build an SMT
system than it was at the beginning of the 2000s, thanks to the huge amount
of work that has been done by the SMT community. For instance, the Moses
framework is equipped with an automated pipeline that allows users to build and
evaluate SMT systems by running a very small number of commands.43 Some
detailed video tutorials are also available to guide users through each of the
steps that may be required to run or re-run specific commands.44 New graphical
tools, such as DoMT, have also recently emerged to hide some of the complexity
associated with some of the training, tuning and evaluation steps, at the desktop
or server level.45
Finally, cloud-based services, such as LetsMT!, KantanMT or Microsoft
Translator Hub, are now also available, almost turning the building of SMT
systems into a one-click process.46, 47, 48 The approach offered by Microsoft
Translator Hub differs from the one proposed by KantanMT and LetsMT! since
the former offers the customization of an existing, generic system while the other
two offer the creation of brand new systems. While the first approach offers
translations for generic words or phrases out-of-the-box, it is unclear how much
additional training data is required to force the translation of specific phrases
or terms. In specialized domains, it is common for some words to take on new
meanings. Occurrences of this new meaning may appear in the additional data
set that is used to customize an existing, generic system, but these occurrences
may not be sufficiently frequent to outweigh the occurrences that were used to
compute the original models. The second approach may offer more control for
the translation of specific domain terms but it is likely to suffer from a coverage
issue if the training data do not fully match the data that should be translated
with newly-built models.
5.6 Post-editing
In a machine translation context, the term post-editing (or postediting or postedition)
is used to refer to the ‘correction of a pre-translated text rather than translation
from scratch’ (Wagner 1985: 1). This definition is complemented by that of Allen
(2003: 207), who explains that the ‘task of the post-editor is to edit, modify and/
or correct pre-translated text that has been processed by a machine translation
system from a source language into (a) target language(s)’. As mentioned in the
previous section, the translation that is produced by machine translation systems
(even customized ones) is often not of sufficient quality to be published. Some
editing is therefore required to fix some of the errors that may have been generated
or introduced by a machine translation system. While some of the translation
suggestions generated by an MT system may be perfectly acceptable translations
(i.e. preserving the meaning of the original sentence and using a fluent style in
the target language), many suggestions contain errors that would be noticed by a
native speaker of the target language. The post-editing task differs from the task
of editing translation memory matches, because the target segments proposed
by a translation memory system tend to be fluent. Reading such segments to
identify missing or extra information is therefore not too demanding from a
cognitive point of view. On the other hand, machine translation output can be
extremely disfluent, which increases the cognitive load since post-editors have
to be able to: (i) identify whether some parts of the machine translation output
are worth preserving, (ii) decide how to best transform an incorrect translation
into a correct one. This task becomes even more challenging when ‘post-editors
become so accustomed to the phrasing produced by the MT output that they will
no longer notice when something is wrong with it’ (Krings 2001: 11). In order
to guide post-editors in making editing choices, various post-editing models and
guidelines have been proposed over the years, as discussed in the next section.
Translation technology 145
5.7.1 Actors
The following actor types are among the most common ones in translation
processes:
• translation buyers
• language service providers
• translators
• translation revisers
• in-country reviewers
• translation users.
As shown in Figure 5.5, some of the checks are specific to user interface
strings. For instance, some checks look for the presence of new line characters
of variable sequences in the translation. Other checks relate to characteristics of
markup content, such as the presence of URLs or HTML tags. These checks can
be crucial because the absence of such entities in the translation may result in
reduced functionality, or worse, in a broken application.
This process can be extremely useful to identify those files that contain high
priority violations. These tools usually give users the possibility to define the
severity of the problems based on their requirements, making it easy to select only
those checks that are relevant for a given project. Examples of checks include
repetitions of words or spaces, corrupted characters, differences in terms of inline
codes or tags, or missing translations. Figure 5.6 shows a list of violations obtained
with the CheckMate tool.62
Translation technology 153
Very often rules must be tweaked to deal with domain or project charac
teristics in order to avoid false positives. The process used to adjust rules is
similar to the one described in Section 3.4.6. For instance, CheckMate can be
configured to leverage the rules offered by LanguageTool. Instead of checking
text in a monolingual context, translated texts can be checked using bilingual
rules, such as rules detecting false friends only when both the source and the
target contain the false friends terms.63 Such bilingual checks can be extremely
powerful in order to detect violations in a context-sensitive manner (e.g. a
translated sentence must not contain the phrase XYZ if the source sentence
contains W). CheckMate also gives the user the possibility to remove pre-
defined patterns and to create new ones using regular expressions, as shown in
Figure 5.7.
This approach can be extremely useful to check file formats that may be using
specific patterns. For example the reStructuredText format in Section 4.3 uses a
notation that may not be covered by existing checking tools out-of-the-box.64
Source content written using this format, however, may have to be translated
Figure 5.6 Checking a TMX file with CheckMate
Listing 5.4 Annotating an issue in XML with ITS local standoff markup
5.8 Conclusions
This chapter has covered many aspects of one of the core localization processes:
translation. While localization is not limited to translation, localization would
not be possible without it. This chapter reviewed some of the tools and standards
that are commonly used in localization-based translation workflows, including
translation management systems, translation environments, terminology
extractors, machine translation and quality assurance tools. While these tools
can often speed up the translation process (and the overall localization process),
they must be carefully selected depending on the workflow that is being used.
Once again, it must be emphasized that localization workflows can range from
simple operations involving a handful of stakeholders to extremely complex ones
where responsibilities are shared among multiple actors. Regardless of the size
of these operations, the common objective of localization workflows is to adapt
digital content for a number of locales that differ from the one for which the
original content was created.
So far the discussion of adaptation has been extremely limited in this book.
Yes, some adaptation is sometimes required to generate effective translations in
a target language (e.g using equivalent idiomatic phrases). Yes, some adaptation
is required to ensure that time and currencies display properly based on the
conventions of the target locale. But perhaps more importantly, adaptation often
needs to go beyond the act of translating software strings or documentation
content. While an application that allows users to select their preferred language
to display a graphical interface can be useful, it is not necessarily as useful as
having the features that are expected by those users. In other words, translated
Translation technology 159
strings are only one aspect of a truly multilingual application. Other aspects,
which include the ability to manipulate and process content in any language,
will be discussed in Section 6.3.3.
5.9 Tasks
This section is divided into four tasks, covering the topics of translation
management systems, translation environments, machine translation and post-
editing, and translation quality assurance.
Notes
1 An API is a specification indicating how software components should interact with
each other. For instance, a collection of public functions included in a software
library can be described as an API. In other situations, an API corresponds to the
remote function calls that can be made by client applications to remote systems.
2 http://www.linport.org/
3 http://wwww.ttt.org/specs/
4 http://gengo.com/
5 http://developers.gengo.com/
6 http://android-developers.blogspot.ie/2013/11/app-translation-service-now-
available.html
7 https://play.google.com/apps/publish/
8 http://android-developers.blogspot.co.uk/2013/10/improved-app-insight-by-linking-
google.html
9 https://developer.apple.com/internationalization/
10 https://developer.mozilla.org/en-US/Apps/Build/Localization/Getting_started_with_
app_localization
11 https://translations.launchpad.net/ubuntu/+translations
162 Translation technology
12 https://www.transifex.com/projects/p/disqus/
13 https://translate.twitter.com/welcome
14 https://www.facebook.com/?sk=translations
15 https://about.twitter.com/company/translation
16 http://support.transifex.com/customer/portal/articles/972120-introduction-to-the-
web-editor
17 http://docs.translatehouse.org/projects/pootle/en/stable-2.5.1/features/index.
html#online-translation-editor
18 http://www.translationtribulations.com/2014/01/the-2013-translation-environment-
tools.html
19 http://www.translationzone.com/products/sdl-trados-studio/
20 http://developer.android.com/distribute/googleplay/publish/localizing.html
21 http://blogs.adobe.com/globalization/2013/06/28/five-golden-rules-to-achieve-agile-
localization/
22 http://www.jboss.org/ The source files for this guide are provided under a
Creative Commons CC-BY-SA license: https://github.com/pressgang/pressgang-
documentation-guide/blob/master/en-US/fallback_content/section-Share_and_
Share_Alike.xml
23 http://www.nltk.org/book/ch07.html
24 http://wordnet.princeton.edu/
25 http://anymalign.limsi.fr#download
26 http://opus.lingfil.uu.se/KDE4.php
27 http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/
poterminology.html#poterminology
28 http://www.eurotermbank.com/
29 http://www.termwiki.com/
30 https://www.microsoft.com/Language/en-US/Default.aspx
31 http://blogs.technet.com/b/terminology/archive/2013/10/01/announcing-the-
microsoft-terminology-service-api.aspx
32 https://www.microsoft.com/Language/en-US/Terminology.aspx
33 https://www.microsoft.com/Language/en-US/Translations.aspx
34 http://www.ttt.org/oscarStandards/tbx/tbx_oscar.pdf
35 http://www.olif.net/
36 http://www.aamt.info/english/utx/
37 http://www.tbxconvert.gevterm.net/
38 https://www.letsmt.eu/Start.aspx
39 http://www.statmt.org/wmt09/translation-task.html
40 https://www.tausdata.org/index.php/data
41 http://mymemory.translated.net
42 http://asiya.cs.upc.edu/demo/asiya_online.php
43 http://www.statmt.org/moses/?n=FactoredTraining.EMS
44 https://labs.taus.net/mt/mosestutorial
45 http://www.precisiontranslationtools.com/products/
46 https://www.letsmt.eu
47 http://www.kantanmt.com/
48 https://hub.microsofttranslator.com
49 https://evaluation.taus.net/resources/guidelines/post-editing/machine-translation-
post-editing-guidelines
50 http://msdn.microsoft.com/en-us/library/hh847650.aspx
51 http://www.matecat.com/wp-content/uploads/2013/01/MateCat-D4.1-V1.1_final.
pdf
52 http://symeval.sourceforge.net
53 www.cen.eu/
Translation technology 163
54 h t t p : / / w w w. l i c s - c e r t i f i c a t i o n . o r g / d o w n l o a d s / 0 4 _ C e r t S c h e m e - L I C S -
EN15038v40_2011-09-01-EN.pdf
55 http://www.huffingtonpost.com/nataly-kelly/ten-common-myths-about-
tr_b_3599644.html
56 The LISA QA Model was initially developed by the now defunct Localization
Industry Standards Association (LISA). Since this model was not a standard, it is no
longer officially maintained.
57 https://evaluation.taus.net/resources-c/guidelines-c/best-practices-on-sampling
58 http://www.dog-gmbh.de/software-produkte/errorspy.html?L=1
59 http://www.qa-distiller.com/
60 http://www.xbench.net/
61 http://www.opentag.com/okapi/wiki/index.php?title=CheckMate
62 http://opus.lingfil.uu.se/KDE4.php
63 http://wiki.languagetool.org/checking-translations-bilingual-texts
64 http://docutils.sourceforge.net/rst.html
65 http://sphinx.readthedocs.org/en/latest/intl.html
66 http://www.digitallinguistics.com/ReviewSentinel.pdf
67 https://github.com/lspecia/quest
68 http://www.quest.dcs.shef.ac.uk/quest_files/features_blackbox_baseline_17
69 http://www.quest.dcs.shef.ac.uk/quest_files/features_blackbox
70 http://www.quest.dcs.shef.ac.uk/quest_files/features_glassbox
71 https://evaluation.taus.net/
72 http://standards.sae.org/j2450_200508/
73 http://www.qt21.eu/launchpad/content/multidimensional-quality-metrics
74 http://www.w3.org/TR/its20
75 http://www.w3.org/TR/its20/examples/xml/EX-locQualityIssue-global-2.xml
Copyright © [29 October 2013] World Wide Web Consortium, (Massachusetts
Institute of Technology, European Research Consortium for Informatics and
Mathematics, Keio University, Beihang). All Rights Reserved. http://www.w3.org/
Consortium/Legal/2002/copyright-documents-20021231
76 http://www.w3.org/TR/its20#lqrating
77 http://www.w3.org/TR/its20/#mtconfidence
78 http://okapi.googlecode.com/git/okapi/examples/java/myFile.html
79 https://www.letsmt.eu/Register.aspx
80 https://evaluation.taus.net/resources/guidelines/post-editing/machine-translation-
post-editing-guidelines
81 http://www.statmt.org/moses/?n=Moses.Baseline
82 http://www.gala-global.org/LTAdvisor/
83 https://directories.taus.net/
84 http://www.internationalwriters.com/toolbox/
6 Advanced localization
Discovery
Search
Experience
Usage Acquisition
Use Purchase
Get Help Download
Learn Install
6.1.1 Screenshots
Some of the graphics present in user assistance content are screenshots or screen
captures, showing specific parts of an environment in which something happens
or needs to be done. The term environment refers here to the Graphical User
Interface (GUI) of a program or set of programs. In user assistance content, some
sections are often illustrated with screenshots whose purpose is to guide users
in step-by-step procedures, such as activating a particular function, modifying
certain settings, or removing an application. Since screenshots sometimes
perform the same function as text instructions, one may wonder why one is
used instead of the other, or why both are sometimes used together. Elements
of answers to this question may be found in a study (Fukuoka et al. 1999) which
found that American and Japanese users believe that more graphics, rather than
fewer, make instructions easier to follow. This study also revealed that users
prefer a combination of text and graphics, which they believe would be more
effective than text-only instructions. From a semiotic perspective, screenshots
play an iconic role (Dirven and Verspoor 1998), because they provide users with
a replication of the environment with which they are interacting.
Screenshots may also provide an illustration of some of the steps users should
follow to fix a problem. Technical support screenshots can sometimes be edited
by content developers to provide extra information to users. Information can
be added using text or graphical drawings, such as arrows or circles, to draw the
attention of the user to a certain part of the replicated GUI. These elements are
examples of an indexing principle (Dirven and Verspoor 1998: 5) because they
draw the attention of the user to a particular action that should be performed, or
to the result of an action. This principle allows users to isolate the component
Advanced localization 167
of the GUI which requires action. As a result of this quick link between
form and meaning, screenshots may replace procedural sentences containing
instructions to find the location of a graphical item, be it a button, a tab, a
pane, a window, a menu bar, a menu item, or a radio button. These elements
may also have an iconic function by replacing the action that the user should
perform on one of these items: to click, to check, to uncheck, or to enter a
word. From a multilingual communicative perspective, those screenshots should
of course be in the language of the users so that their primary iconic function
can be fully performed. However, this is not always possible, because third-party
English applications are not always localized. The handling of screenshots is
therefore a complex localization process. It is sometimes difficult or impossible
for a human translator or quality assurance specialist to find the corresponding
screenshot in his or her own language. The time required to perform such a
search during the translation process should therefore not be underestimated.
This is not the only drawback of screenshots when they are included in technical
support documents. Screenshots can also create accessibility issues for users with
eyesight-related difficulties. If screenshots are not accompanied by alternative
text as discussed in Section 3.3.1, they may be ignored by accessibility tools
such as screen narrators. Besides, a screenshot may impact on the reliability
of a document over time, or at least baffle users running older versions of the
product for which the document was originally intended. This situation can
happen when the GUI changes over time. For instance, a document applies to
several versions of an operating system when the text used in the document does
not focus on any particular version. If a screenshot is introduced, the document
may be perceived as version-specific by certain users. If the screenshot does not
exactly match their environment, certain users may come to the conclusion that
the document does not apply to them.
Screenshots may also be included in other document types. For example,
they are increasingly used in pages associated with the description of mobile or
platform-specific applications that can be downloaded from specific Web sites
(often referred to as app stores), as shown in Figure 6.2.
These descriptions, which are consulted by prospective users during the
discovery phase, may contain a mix of text and graphics, so having content that
seems relevant to potential users is essential. When screenshots are used in this
context, their main function is to promote an application by giving users a quick
view of the application’s main functionality. Since users’ decisions to select a
particular application in a given application category (e.g. a calendar application)
are made quickly based on the increasingly large number of applications
available, screenshots should be both engaging and relevant. For example, if the
application has been localized from English into French and German, providing
screenshots with English text (either in the User Interface or in input fields) may
be detrimental to future uptake in French- and German-speaking locales. It is also
not always sufficient to provide localized screenshots if the examples contained
in the screenshots are not relevant. In the case of a restaurant recommendation
application targeting Japanese-speaking users, providing an example of a search
168 Advanced localization
for restaurants in San Francisco may not be as powerful as a search for restaurants
in Tokyo.1
To some extent, this characteristic also applies to video clips (or videos) that
are sometimes linked to technical support documents. These video clips contain
step-by-step tutorials designed to help users find an answer to their question.
From a localization perspective, this type of element is even more complex than
static screenshots, while having the same pragmatic function as plain text. Some
aspects of the localization of this type of content are covered in Section 6.1.3
once other graphic types have been discussed.
Video subtitling
While it is not necessary to have access to a voice-over transcript to create
localized subtitles, its presence can simplify the translation process. For example,
a translation memory could be used to leverage previous translations based on
an analysis of the source transcript. Three steps are required to generate video
subtitles: the actual creation of the subtitles, the synchronization of the subtitles
with the audio track and a final review to refine the translation. All of these steps
can be performed using dedicated software, such as the online Amara service.2
This service is maintained by the Participatory Culture Foundation, which is a
‘non-profit organization building free and open tools for a more democratic and
decentralized media’.3 This online service allows users to generate subtitles in the
language of their choice using the interface shown in Figure 6.3.
The goal of the first step is to type translations for the words that correspond
to the words spoken in the audio track. In the case of a product tutorial, these
words are spoken by an instructor who may be describing steps to achieve a
particular objective (such as installing a product or using a particular product
feature to perform a task). The Amara software automatically stops every eight
seconds to make sure that the narrated text is broken down into manageable,
easy-to-remember chunks. During this first step, typing mistakes can be made
While the first guideline applies to subtitles that are aimed at hard-of-hearing
users, the second guideline is extremely important because this text cannot
be localized without re-shooting the video. While this guideline can be easily
applied when the number of signs is small, it becomes much more challenging
when a user interface is being recorded in the context of a tutorial (screencast).
Having to add subtitles for all GUI labels that are being clicked by a user may
be impossible, especially when the user is also describing the actions they are
performing. In such an extreme case, it would seem preferable to record the video
in the target language with a target GUI. The objective of the third guideline is
to improve the final user experience, by making sure that sentences are not split
in an unusual way. Unless short sentences are used, however, the implementation
of this guideline might result in having the user read a substantial amount of
information on screen in one go. From a comprehensibility perspective, it seems
preferable to avoid having to remember what was said in previous subtitles.
Having standalone subtitles does not only improve comprehensibility, it also
improves translatability as explained in the following section.
• Finding a list of keywords and search terms that users use in order to find
applications.
• Understanding the keywords that are used by competing applications.
• Identifying popular search terms that are currently matched by a small
number of applications (or even better, by no application).
Once these steps are completed, the identified terms can be used in the
application’s description or in any field that is indexed by the application
repository’s search engine. As shown by the nature of these steps, such an activity
is very different in nature from a traditional approach to translation, hence the
need to categorize it under adaptation. Two other types of textual adaptation are
discussed in the remainder of this section, transcreation and personalization.
174 Advanced localization
6.2.1 Transcreation
Transcreation was briefly introduced in Section 1.3. The challenge with this
concept is that it sometimes overlaps with translation. After all, adaptation is
one of the translation techniques that translators rely on to transfer elements of
a source text into a target text. Examples of such elements include ‘prices [that]
should be in local currency and phone numbers [that] should reflect national
conventions’ in translated texts (DePalma 2002: 67). The adaptation of such
elements can be challenging because changing a currency symbol to another and
using a standard conversion rate is unlikely to be sufficient because of specific,
local pricing strategies. Changing a phone number by adding a prefix is also
unlikely to be sufficient because a phone number in Germany is not going to
be useful for a customer based in Japan who is looking for technical support in
Japanese during the Usage phase. Determining whether such equivalent phone
numbers or addresses are available may not always be straightforward because
some locales may not have dedicated local support teams, especially if support
teams are shared across multiple locales. Obviously, this type of adaptation should
be identified early in the source content creation process, so that specific measures
can be taken. For instance, specific supporting materials may be provided to
translators as part of the translation guidelines or this type of content can be
excluded from the translation process altogether. So if adaptation is part of the
translation process, why is a new term such as transcreation required?
When a few adaptation issues are scattered across an informative document
(say, a user guide), a standard translation process such as the one presented in
Section 4.3 can be used. When such adaptation issues appear throughout a
document that is trying to trigger a reaction from the user, however, another
strategy may have to be considered. Rather than being specific about how the
content should be translated (e.g. using tools, guidelines and reference assets),
translators may be given complete carte blanche to create a document in the
target language that matches the intent of the source text. For instance, the
Mozilla foundation provides the following adaptation guidelines for Web site
content, campaigns and other communications intended for a general audience:
‘Localized content should [not] be a literal translation, but it should capture the
same meaning and sentiment. So feel free to pull it apart and put it back together;
replace an English expression with one from your native language; Mozilla-fy it
for your region.’ 8
Ray and Kelly (2010: 3) indicate that ‘typical projects that require
transcreation include Web campaigns that do not attract customers in other
markets, ads that are based on wordplay, humour that is directly related to just
one language or culture, or products and services that need to be marketed to
diverse demographics within the same market’. Obviously this creative process
requires more time than standard translations because multiple variants may have
to be considered before an acceptable wording is found in the target language.9
In these situations, translation is often going to be inadequate, which is why
transcreation is required. While translation endeavours to somehow reuse some
Advanced localization 175
aspects of the source text (e.g. information structure), transcreation seeks to have
the target text achieve the same high-level goal as the source text or brief (e.g.
convincing users to buy a product). In such a scenario, the source words, phrases,
sounds and structure no longer matter: it is all about leveraging target cultural
norms and expectations.
It is worth mentioning that the use of transcreation may create some challenges,
in terms of cost and brand protection, as mentioned by the head of Marketing
Localization at Adobe: ‘The challenge here is the balance between giving more
flexibility and freedom of expression to the regions and the use of productivity
tools such as Translation Memory. If we want to leverage the savings that TMs
and other tools offer to localization (and we do), we can offer some flexibility in
the target content but not as much as sometimes the regions would like to have.
[…] We are very protective when it comes to the Adobe brand and although the
regional offices are given some flexibility in terms of creating some of their own
marketing materials (in their original language), Adobe’s Brand team normally
is involved to make sure the materials follow the established international brand
guidelines.’10 A more detailed discussion of the techniques that can be used when
dealing with text types such as marketing or advertising documents can be found
in Torresi (2010).
6.2.2 Personalization
Another type of textual adaptation is personalization. Personalization can be
achieved using a couple of approaches, but only one of them is relevant to the
present discussion. The first approach consists in focusing on linking content to
other content based on specific attributes. For instance, some content may be
recommended to a user who has previously consumed specific content. This type
of personalization does not need to take into account local knowledge.
The second type of personalization, which is the focus of this section, refers
to the adaptation process that is required to meet the expectations or needs of
specific individuals (or even of a single individual). Such individuals can be
grouped into personas whose specific characteristics guide the content creation
or personalization processes.11 The advantage of this approach is that it is not
constrained by pre-defined characteristics, such as the location of a user. While it
might be tempting to assume that users from a specific region are likely to behave
in a similar manner, one should not forget that other factors can come into play,
such as age or fields of interests. For instance, a user based in Germany (who
happens to study English in college) may have more in common with an American
user of the same age group than with another German person from a different
age group. This means that targeting users by focusing exclusively on location
can be sub-optimal. Rather than assuming that a user should be presented with
content in a language based on its geographical location (e.g. German if the IP
address of the system making a Web request is associated with Germany), content
publishers can take into account users’ linguistic preferences. Such preferences,
which are captured in the language preference settings of a Web browser, are
176 Advanced localization
6.3.2 Services
More and more applications (whether they are Web, mobile or desktop
applications) are connected to Web services in order to provide functionality
that may not be practical or desirable to provide in the application itself. For
instance, it is currently not practical to access a generic search engine on a mobile
device without being connected to the Internet. The computing power required
to perform a standard Web search is well beyond the capability of most modern
mobile devices. It is also convenient for software publishers to make some of their
functionality available as Web services instead of packaging them in standalone
applications. Even if standalone applications are published in closed, proprietary
formats, it is always possible to reverse engineer them and access their source
code. Making key functionality available as a Web service therefore allows
software publishers to keep their code away from curious eyes.
Examples of such Web services include services providing weather forecast
predictions, news information, stock market values, text or speech translation,
information search results, etc. Some Web services are obviously more popular
than others depending on the locale where they are available. According to
an online report, while most worldwide users tend to favour the Google search
engine, most Chinese users tend to rely on the Baidu search engine while most
Russian users tend to rely on the Yandex search engine.18 For an application to
have the expected impact in any given locale, such local preferences have to be
taken into account. For instance, if an application (such as a word processing
application or a reference management application) allows users to perform
searches using a specific search engine service, this functionality may have to
be adapted to either support additional search engine services or replace the
existing one with a local service. Online services, such as eBay or Google News
may not be as popular (or even available) in other locales, so adaptation may
be required to tailor this list for specific locales. Such adaptation work may be
labour-intensive, especially if such services do not rely on industry standards to
receive and respond to requests. The work can also be further complicated if some
services are not documented in the language of the developer who is responsible
for the adaptation. A good example of service adaptation was provided by Apple
in 2012 with one of the releases of their OS X operating systems. This complex
piece of software was specifically adapted for the Chinese market, so that its users
could select Baidu search in the Web browser, set up their contacts, mail and
calendar with service providers such as QQ, 126.com and 163.com, or upload
videos to the Youku and Tudou Web sites.19
Another example of service adaptation that may be required to meet
the expectations of local users is related to the way local payments are made.
While some credit card brands are very popular in many countries, other forms
of payment exist. When popular forms of payment are unsupported, users are
left frustrated and customers are lost. As an example, the publisher of the Clash
of Clans mobile game encountered an issue with Chinese users because it was
soliciting in-application payments through a specific market store application
Advanced localization 179
which was unsupported in China.20 For this reason, it is now becoming customary
for global businesses to support local payment providers, such as allpago in Brazil
or Alipay in China.21, 22
6.5 Conclusions
This chapter covered a number of topics that may not be of primary concern to
translators whose main day-to-day activity is translation. For those translators,
however, who are seeking to diversify their activities by offering additional
services to customers, concepts such as transcreation should be extremely
relevant. Globalization project managers should also be particularly interested in
all of the topics covered in this chapter, since crucial business decisions related
to topics such as culture and location must be taken into account before delving
into a traditional localization process centred around translation. Once again, it
is worth emphasizing that the translation act is only relevant if it serves a specific
need. Whether the need is related to the generation of local content used to
convince a customer to purchase an application or service or to the generation of
support content used to assist customers, the expectations of the target content
consumer(s) should always be made a priority of the person involved in the
translation process. This chapter has hopefully demonstrated that in specific
cases, translation is not sufficient to meet the expectations of a target customer.
Various levels of adaptation (be it at cosmetic or functional level) are often
required to truly localize an application so that it can be competitive against
native applications in a specific domain. The following section offers two tasks
that are related to the topics introduced and discussed in this chapter.
6.6 Tasks
This section is divided into two tasks, covering the topics of transcreation and
functional adaptation.
Notes
1 http://thenextweb.com/insider/2013/03/23/how-we-tripled-our-user-base-by-getting-
localization-right/
2 http://www.amara.org
3 http://pculture.org
4 http://www.ted.com/pages/translation_quick_start
5 http://youtubecreator.blogspot.fr/2013/02/get-your-youtube-video-captions.html
6 https://blogs.adobe.com/globalization/adobe-flash-guidelines/
7 http://makeappmag.com/iphone-app-localization-keywords/
8 https://www.mozilla.org/en-US/styleguide/communications/translation/
9 Since more time is required, the activity may be paid by the hour instead of the
word: http://www.smartling.com/blog/2014/07/21/six-ways-transcreation-differs-
translation/
10 http://blogs.adobe.com/globalization/marketing-localization-at-adobe-what-works-
whats-challenging/
184 Advanced localization
11 http://thecontentwrangler.com/2011/08/23/personas-in-user-experience/
12 http://www.w3.org/International/questions/qa-lang-priorities
13 http://www.w3.org/International/questions/images/fr-lang-settings-ok.png Copyright
© [2012-08-20] World Wide Web Consortium, (Massachusetts Institute of
Technology, European Research Consortium for Informatics and Mathematics,
Keio University, Beihang). All Rights Reserved. http://www.w3.org/Consortium/
Legal/2002/copyright-documents-20021231
14 Copyright © [2012-08-20] World Wide Web Consortium, (Massachusetts Institute
of Technology, European Research Consortium for Informatics and Mathematics,
Keio University, Beihang). All Rights Reserved. http://www.w3.org/Consortium/
Legal/2002/copyright-documents-20021231
15 https://developer.amazon.com/appsandservices/apis/manage/ab-testing
16 https://www.optimizely.com
17 http://www.w3.org/TR/html5/text-level-semantics.html#the-span-element
18 http://returnonnow.com/internet-marketing-resources/2013-search-engine-market-
share-by-country/
19 http://support.apple.com/kb/ht5380
20 http://techcrunch.com/2013/12/07/gamelocalizationchina/
21 http://www.allpago.com/
22 http://www.techinasia.com/evernote-china-alipay/
23 http://www.nltk.org/nltk_data/
24 http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison
25 http://docs.mongodb.org/manual/reference/text-search-languages#text-search-
languages
26 http://www.basistech.com/text-analytics/rosette/base-linguistics/
27 http://www.oracle.com/us/technologies/embedded/025613.htm
28 http://www.clsp.jhu.edu/user_uploads/seminars/Seminar_Pedro.pdf
29 http://en.wikipedia.org/wiki/Hop_(networking)
30 http://www.smartling.com/translation-software-solutions
31 http://localize.reverso.net/Default.aspx?lang=en
32 http://techcrunch.com/2013/05/07/evernote-launches-yinxiang-biji-business-taking-
its-premium-business-service-to-china/
33 http://blog.evernote.com/blog/2012/05/09/evernote-launches-separate-chinese-
service/
34 http://mashable.com/2014/01/26/south-korea-5g/
35 http://bit.ly/ms-xp-support-end
36 https://languagetool.org/languages/
37 http://wiki.apertium.org/wiki/List_of_language_pairs
7 Conclusions
The global software industry, including the localization industry, is going through
many changes, which makes it very different from what it was at the beginning
of the 2000s (or even 2010s). Some of these changes, such as continuous
localization, are extremely disruptive and have a profound impact on the daily
work of translators and localizers. In this last chapter, the topics that have been
covered in Chapters 2, 3, 4, 5 and 6 will be briefly revisited in the light of current
and future trends, such as mobile and cloud computing. As much as possible,
additional research opportunities will also be identified. The second part of this
chapter will attempt to be even more future-facing and briefly discuss some of the
new directions that global application publishers could embrace in the years to
come in order to understand the impact they may have on the world of translators.
7.1 Programming
In Chapter 2, basic programming concepts were introduced for two reasons: to
introduce localizers to key software development concepts (so that they become
more comfortable with technical aspects of the localization process), but also
to introduce technically-oriented localizers to some programming and text
processing techniques that could boost their productivity.
Software development practices are shifting increasingly towards a continuous
delivery model. Regular version updates are increasingly being replaced by a
stream of incremental or disruptive updates. As far as end-users are concerned,
product version numbers do not matter so much – what matters most to them is
the functionality that is provided by a given application. Version numbers are
unlikely to completely disappear since they are useful to determine why users
may be experiencing certain issues. However, software publishers are increasingly
aligning the release of updates with usage data. This is especially the case for
hosted or cloud-based applications, since application or service providers are able
to test in live conditions the impact of an update only with a subset of their user
base. If the outcome of the test is deemed negative, then the update to the whole
user base can be postponed or even discarded.
As far as industry trends are concerned, mobile and cloud computing are
affecting not only the IT industry but also the lives of a large proportion of the
186 Conclusions
world’s population. There have never been more mobile handsets in use in the
world and this increase is unlikely to stop any time soon. A couple of platforms
are currently dominating the mobile market, namely Android and iOS, but the
Windows Phone operating system is likely to challenge these two thanks to
Microsoft’s recent acquisition of Nokia. Other open-source platforms, such as
Firefox OS and Ubuntu, could grow in popularity, especially in emerging markets.
From a translator’s perspective, the proliferation of platforms can only have a
positive impact. If more than one platform exists, then it means localized resources
will be required. Obviously such platforms can share localized resources, such as
those offered by the Unicode Common Locale Data Repository.1 But additional
resources, such as interface strings or help content, will have to be localized.
Cloud computing is another trend that has been affecting both the IT industry
and the language service industry. What used to be performed by local servers
can now be accomplished more easily using scalable, online infrastructures. This
perceived ease of accomplishing computing tasks must, however, be offset against
some of the privacy and security risks that still characterize cloud-based services.
Some entities are still reluctant to fully trust such services and regular data leaks
or surveillance scandals do not improve the situation. Surveillance scandals
could actually have a profound negative impact on the localization industry since
governments, companies or individuals may be tempted to favour local providers
for trust reasons instead of relying on global providers offering localized services.2
As far as the future of the Python programming language is concerned, the debate
about versions 2 and 3 is bound to continue for some time. The choice that was
made and justified in Section 1.5 to introduce version 2 in this book is supported
by voices in the Python community, who maintain that version 3 (especially its
support for Unicode) is not as ideal as may once have been presented.3 In any
case, the support for the 2.x series of Python has recently been extended until
2020. This is both good news and bad news for the Python community. On the
one hand, it gives library developers or code maintainers more time to port their
code base to a new version. On the other hand, it enhances the division between
two community camps, which could lead to an unresolved situation. As far as
novice programmers are concerned, this situation should be acknowledged but
not necessarily seen as a blocker to embrace the language. It has never been easier
to get started with the language in an exploratory and collaborative manner,
thanks to online services such as those presented in Chapter 2 (or others such
as Wakari.IO or nbviewer), so the entry barrier has never been lower.4, 5 Even
if the goal in learning a programming language is not necessarily to compete
with experienced coders, it must be emphasized that being able to automate tasks
using a language without having to rely on a developer can be a great advantage.
7.2 Internationalization
The concept of internationalization was the focus of Chapter 3. The discussion
focused mainly on the way content may be internationalized in order to make
downstream processes (such as translation or adaptation) more efficient.
Conclusions 187
Specifically, global writing guidelines were reviewed from a perspective of
(technical) content authoring. Special emphasis was also placed on the way user
interface strings should be handled in source code so that they can be easily
extracted and localized in multilingual applications. The discussion on functional
adaptation in Chapter 6 also highlighted the advantage of using mature frameworks
and libraries in order to leverage functionality that can handle multiple language
inputs (be it from a text, speech or even graphical perspective) and formats. While
it is possible to argue that mature internationalized frameworks and libraries exist
(such as the Django framework that was introduced in Chapter 3), one may
regret that the use of internationalization features is not enabled by default in
most programming languages. For instance, the Python programming language
allows developers to declare string variables without having to mark them
explicitly (so that they can be extracted using the gettext mechanism). Similarly
the Java programming language does not force developers to use properties files
and resource bundles by default. Since the use of these mechanisms requires
extra typing (and potentially extra overhead if it is not required), it is easy to
understand why it is often ignored when applications are first developed. This is
especially true when applications originate from research prototypes, which are
often developed in a quick and unstructured manner without any guarantee they
can be turned into successful, global solutions. In short, the benefits that can
be achieved by using such mechanisms are often not clear to the person who is
writing the source code. The situation is obviously very different if the use of such
mechanisms is mandated as part of a list of requirements. Again, it is often the
case that the first version of a product will not accommodate such a requirement
for two main reasons:
Staggering success may seem counter-intuitive, but quick success can have
unwanted consequences. First, the infrastructure supporting a service may not
be ready to accommodate thousands or millions of users so it may be preferable
to restrict access by not supporting certain languages. Second, a company
having to justify growth to shareholders or venture capitalists may prefer to
distribute registrations or downloads over time. Obviously these two reasons
do not mean that internationalization techniques have to be ignored. A careful
global, planning process may mandate the use of these techniques and postpone
localization activities for a later phase. However, it is extremely easy to neglect
such techniques when the requirement to serve global users competes with other
equally important requirements (e.g. improving the user experience, the stability
of an infrastructure or the security of an application or service).
188 Conclusions
To the author’s knowledge, no programming language has been designed
with default internationalization principles in mind, but this situation may
change in the future. Designing such a language would be challenging because
its capability may be limited by the environment (e.g. the operating system) on
which it would be executed. However, this prospect may be more comforting
than the situation that affects many programming languages, even very popular
ones. One good example is JavaScript, whose level of internationalization
maturity is very low (due to issues caused by legacy browsers, the lack of well-
established specifications and a multitude of tools and utilities that tend to
reinvent the wheel). This situation is problematic from multiple perspectives:
first, it may discourage developers to provide internationalized support by
default because navigating the complexity of specifications and libraries can be
an extremely daunting task. Second, it can lead to a situation where it seems
easier for a developer to come up with their own scheme, which contributes
to this unfortunate status-quo. Even if the possibility of having a global
programming language does not materialize, developing a resource repository
recommending best practices in terms of internationalization per programming
language would be extremely valuable. While a working group from the W3C
consortium specializes in internationalization-related topics as far as Web
technologies are concerned, its work is mostly limited to markup languages,
such as HTML and XML, so there seems to be a gap as far as programming
languages are concerned. The Unicode consortium, whose goal is to enable
people around the world to use computers in any language, may have a role to
play in such an endeavour.
From a research perspective, certain internationalization-related questions
are still open. While the scanning of source code to detect unmarked strings is
well understood (and supported by multiple tools), detecting internationalization
issues due to the use of existing or new functions may be less straightforward to
accomplish. For instance, a developer writing a global travel booking application
may use functions that process user input. For example, a normalization function
may be used to automatically correct spelling mistakes that may have been
made by a user when searching for a destination. From an internationalization
perspective, should (and if so, how?) the developer be alerted when they write
code in a locale-specific manner? One could argue that this type of work is the
remit of quality assurance, but global efficiency gains may be realized if these
issues were taken care of during the development process.
7.3 Localization
Chapter 4 focused on the language-based localization processes affecting mostly
text present in an application’s user interfaces and its associated documentation
content. These processes fall into two categories. The first one is related to
sequential workflows involving the extraction strings for translation before they
can be merged back into resources required to build a target or multilingual
application. The other approach consists of using a more visual approach so
Conclusions 189
that translation can be performed in-context. While the second approach is
currently limited to desktop or Web-based applications, it is likely to gain in
popularity, especially if mobile applications can be localized in such a way in the
future (by possibly leveraging emulators, which can duplicate the behaviour of a
mobile application, say, in a Web browser). The proliferation of cloud services is
obviously also favouring this second approach since test or staging environments
where translators can translate in-context can now be set-up in seconds (instead
of weeks or days).
7.4 Translation
In Chapter 5, multiple translation technologies were discussed, since these
technologies tend to make translators more productive. Expectations around
translation turnaround times will always be more and more aggressive, which is
to be expected because the timeliness of a translation contributes greatly to its
usefulness. Obviously, other factors such as quality are to be taken into account,
which is why the use of machine translation technology is often coupled
with a post-editing process, during which translators are expected to validate
or edit translations. Such a task is obviously very different in nature from a
translation task whose goal is to create target text that does not strictly follow
the structure of the source text. While recent advances in machine translation
have made its use ubiquitous (especially in situations when users are unwilling
to pay for any direct human intervention), human validation is still required in
situations where information accuracy is critical. Recent research efforts have
therefore focused on investigating whether it is possible to (i) identify those
parts of documents that require editing and (ii) possibly determine whether
the editing would require more effort than translating from scratch. MT quality
estimation has made progress is recent years (Soricut et al. 2012; Rubino et
al. 2013). However, more work is required to improve the accuracy of such
systems, especially as far as the second task is concerned. Without relying on
external characteristics (such as the domain knowledge of the post-editor, how
familiar and enthusiastic the post-editor is about post-editing, or even how fresh
the post-editor is to complete the task), it seems very difficult to rely purely
on textual or system-dependent features to determine how much effort would
be required. Ultimately, one could argue that it should not matter whether a
translator decides to post-edit or re-translate from scratch a segment that has
been deemed to be of insufficient quality by a prediction system. But it does
matter if the translator is not paid fairly for the amount of time spent on the
task. The compensation of post-editing is indeed a topic of debate because it is
poorly supported by the traditional model based on word count and translation
memory matching. For this reason, it is difficult to imagine how sustainable
human post-editing services offering flat fees of a few cents per word will be.6
It seems that the amount of time spent on a task seems a more appropriate way
to compensate workers. Obviously time-tracking is not without its pitfalls (e.g.
what happens when somebody takes a coffee break or answers an email about
190 Conclusions
a new task request?), but some post-editing systems, such as the ACCEPT
system, have the ability to track how much time was spent on a given segment
(Roturier et al. 2013). Another aspect of post-editing, which may require further
investigations in the years to come, is the environment where the post-editing
task is conducted. Traditional translation environments have been desktop-
based for years and have recently been challenged by Web-based environments
thanks to increased network speeds. However, mobile-based environments
are now being considered as an alternative to mature environments favoured
by professional translators. For instance, the first version of a post-editing
application specifically designed for a mobile environment, Kanjingo, was
recently tested with a few users and received positive feedback (although areas
of improvement such as the ability to leverage functionality such as auto-
completion and synonyms were mentioned) (O’Brien et al. 2014).
7.5 Adaptation
Adaptation is a very generic term that can encompass many different activities
required to create a multilingual application (or transform an existing
application into a multilingual one). The emergence of transcreation as an
activity is not surprising since the competition between global and local
companies has never been as fierce. While some companies may have gotten
away with source-oriented translated messaging in the past (simply because
there was little or no competition), in order to win, locally targeted or
personalized messages must now be used. As far as video content is concerned,
in-context subtitling has become a mature process. One of the next challenges
would be to investigate the feasibility of in-context dubbing, since this mode
of communication may be favoured in certain locales. Chapter 6 also showed
that adaptation is not limited to the content used to convince users to buy a
particular application or service. Some resources, which may be core to the
functionality of an application, sometimes have to be adapted in order to truly
meet (or even exceed) the needs and expectations of global users. Whether
the adaptation of functionality should be considered as a localization-related
activity (as it was in this book) or an internationalization activity is a moot
point. What matters for end-users is that the applications they have decided
to install (possibly after purchasing them) behave in a way that is consistent
with the environment in which they operate. As surprising as it may seem,
the localization literature focuses predominantly on the textual aspect of the
localization process rather than on its functional aspect. More research seems
therefore required to understand better how a global application differs from
a native application in terms of behaviour. Even though the features lists of
competing applications can be used as a starting point to identify overlaps and
gaps, thorough functional evaluations could be envisaged in order to highlight
areas where systems or applications perform significantly differently. Since
many systems now expose functionality through APIs, it might even be possible
to semi-automate such comparative evaluations.7
Conclusions 191
Notes
1 http://cldr.unicode.org/
2 h t t p : / / w w w. r e u t e r s . c o m / a r t i c l e / 2 0 1 5 / 0 2 / 2 5 / u s - c h i n a - t e c h - e x c l u s i v e -
idUSKBN0LT1B020150225
3 http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
4 https://www.wakari.io/
5 http://nbviewer.ipython.org/
6 https://www.unbabel.com/
7 http://www.programmableweb.com/
8 https://www.google.ie/mobile/translate/
9 http://readwrite.com/2014/04/16/microsoft-cortana-siri-google-now
Bibliography
Adams, A., Austin, G., and Taylor, M. (1999). Developing a resource for multinational
writing at Xerox Corporation. Technical Communication, pages 249–54.
Adriaens, G. and Schreurs, D. (1992). From Cogram to Alcogram: Toward a controlled
English grammar checker. In Proceedings of the 14th International Conference on
Computational Linguistics, COLING 92, pages 595–601, Nantes, France.
Aikawa, T., Schwartz, L., King, R., Corston-Oliver, M., and Lozano, C. (2007). Impact
of controlled language on translation quality and post-editing in a statistical machine
translation environment. In Proceedings of MT Summit XI, pages 1–7, Copenhagen,
Denmark.
Alabau, V. and Leiva, L. A. (2014). Collaborative Web UI localization, or how to build
feature-rich multilingual datasets. In Proceedings of the 17th Annual Conference of the
European Association for Machine Translation (EAMT’l4), pages 151–4, Dubrovnik,
Croatia.
Alabau, V., Leiva, L. A., Ortiz-Mart, D., and Casacuberta, F. (2012). User evaluation of
interactive machine translation systems. In Proceedings of the 16th EAMT Conference,
pages 20–3, Trento, Italy.
Allen, J. (1999). Adapting the concept of ‘translation memory’ to ‘authoring memory’ for a
controlled language writing environment. In Translating and the Computer 21: Proceedings
of the Twenty-First International Conference on ‘Translating and the Computer’, London.
Allen, J. (2001). Post-editing: an integrated part of a translation software program.
Language International, April, pages 26–9.
Allen, J. (2003). Post-editing. In Somers, H., editor, Computers and Translation: A
Translator’s Guide, pages 297–317, John Benjamins Publishing Company, Amsterdam.
Amant, K. S. (2003). Designing effective writing-for-translation intranet sites. IEEE
Transactions on Professional Communication, 46(1): 55–62.
Arnold, D., Balkan, L., Meijer, S., Humphreys, R., and Sadler, L. (1994). Machine
Translation: an Introductory Guide. Blackwells-NCC, London.
Austermuhl, F. (2014). Electronic Tools for Translators. Routledge, London.
Aziz, W., Castilho, S., and Specia, L. (2012). PET: a tool for post-editing and assessing
machine translation. In Calzolari, N., Choukri, K., Declerck, T., Dogan, M. U.,
Maegaard, B., Mariani, J., Odijk, J., and Piperidis, S., editors, Proceedings of the Eighth
International Conference on Language Resources and Evaluation (LREC-2012), pages
3982–7, Istanbul, Turkey. European Language Resources Association (ELRA), Paris.
Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., Lagarda,
A. L., Ney, H., Tomás, J., Vidal, E., and Vilar, J. M. (2009). Statistical approaches to
computer-assisted translation. Computational Linguistics, 35(l): 3–28.
Bibliography 195
Barreiro, A., Scott, B., Kasper, W., and Kiefer, B. (2011). OpenLogos machine translation:
philosophy, model, resources and customization. Machine Translation, 25(2): 107–26.
Baruch, T. (2012). Localizing brand names. MultiLingual, 23(4): 40–2.
Bel, N., Papavasiliou, V., Prokopidis, P., Toral, A., and Arranz, V. (2013). Mining and
exploiting domain-specific corpora in the panacea platform. In BUCC 2012, The 5th
Workshop on Building and Using Comparable Corpora: “Language Resources for Machine
Translation in Less-Resourced Languages and Domains”, pages 24–6, Istanbul, Turkey.
Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B., Ackerman, M. S., Karger, D.
R., Crowell, D., and Panovich, K. (2010). Soylent: a word processor with a crowd
inside. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software
and Technology, pages 313–22, ACM, New York.
Bernth, A. (1998). EasyEnglish: Preprocessing for MT. In Proceedings of the Second
International Workshop on Controlled Language Applications (CLAW 1998), pages 30–41,
Pittsburgh, PA.
Bernth, A. and Gdaniec, C. (2002). MTranslatability. Machine Translation, 16: 175–218.
Bernth, A. and McCord, M. C. (2000). The effect of source analysis on translation
confidence. In White, J., editor, Envisioning Machine Translation in the Information
Future: Proceedings of the 4th Conference of the Association for MT in the Americas,
AMTA 2000, Cuernavaca, Mexico, pages 89–99, Springer-Verlag, Berlin, Germany, .
Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. O’Reilly
Media, Inc., Sebastopol, CA, 1st edition.
Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and
Ueffing, N. (2004). Confidence estimation for machine translation. In Proceedings of the
20th International Conference on Computational Linguistics, pages 315–21, Association
for Computational Linguistics, Stroudsburg, PA.
Bowker, L. (2005). Productivity vs quality? a pilot study on the impact of translation
memory systems. Localisation Focus 4(1): 13–20.
Brown, P. E., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The mathematics
of statistical machine translation: Parameter estimation. Computational Linguistics, 19:
263–311.
Bruckner, C. and Plitt, M. (2001). Evaluating the operational benefit of using machine
translation output as translation memory input. In MT Summit VIII, MT evaluation:
who did what to whom (Fourth ISLE workshop), pages 61–5, Santiago de Compostela,
Spain.
Byrne, J. (2004). Textual Cognetics and the Role of Iconic Linkage in Software User
Guides. PhD thesis, Dublin City University, Dublin, Ireland.
Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2012).
Findings of the 2012 workshop on statistical machine translation. In Proceedings of the
Seventh Workshop on Statistical Machine Translation, pages 10–51, Montreal, Canada,
Association for Computational Linguistics, New York.
Carl, M. (2012). Translog-II: a program for recording user activity data for empirical
reading and writing research. In Proceedings of the Eighth International Conference on
Language Resources and Evaluation (LREC-2012), pages 4108–12, Istanbul, Turkey,
European Language Resources Association (ELRA), Paris.
Casacuberta, F., Civera, J., Cubel, E., Lagarda, A. L., Lapalme, G., Macklovitch, E.,
and Vidal, E. (2009). Human interaction for high-quality machine translation.
Communications of the ACM – A View of Parallel Computing, 52(10): 135–8.
Chandler, H. M., Deming, S. O., et al. (2011). The Game Localization Handbook. Jones &
Bartlett Publishers, Sudbury, MA.
196 Bibliography
Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation.
In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
(ACL ’05), pages 263–70, Morristown, NJ, Association for Computational Linguistics,
New York.
Choudhury, R. and McConnell, B. (2013). TAUS translation technology landscape
report. Technical report, TAUS, Amsterdam.
Clémencin, G. (1996). Integration of a CL-checker in a operational SGML authoring
environment. In Proceedings of the First Controlled Language Application Workshop
(CLAW 1996), pages 32–41, Leuven, Belgium.
Collins, L. and Pahl, C. (2013). A service localisation platform. In SERVICE
COMPUTATION 2013, The Fifth International Conferences on Advanced Service
Computing, pages 6–12, IARIA, Wilmington, NC.
Conati, C., Hoque, E., Toker, D., and Steichen, B. (2013). When to adapt: Detecting
user’s confusion during visualization processing. In Proceedings of 1st International
Workshop on User-Adaptive Visualization (WUAV 2013), Rome, Italy.
D’Agenais, J. and Carruthers, J. (1985). Creating Effective Manuals. South-Western Pub,
Co, Cincinnati, OH.
Deitsch, A. and Czarnecki, D. (2001). Java Internationalization. O’Reilly Media, Inc,
Sebastopol, CA.
Denkowski, M. and Lavie, A. (2011). Meteor 1.3: Automatic metric for reliable
optimization and evaluation of machine translation systems. In Proceedings of the
EMNLP 2011 Workshop on Statistical Machine Translation, Edinburgh, U.K.
DePalma, D. A. (2002). Business Without Borders. John Wiley & Sons, Inc., New York.
DePalma, D., Hegde, V., and Stewart, R. G. (2011). How much does global contribute to
revenue? Technical report, Common Sense Advisory, Lowell, MA.
Dirven, R. and Verspoor, M. (1998). Cognitive Exploration of Language and Linguistics. John
Benjamins Publishing, Amsterdam.
Dombek, M. (2014). A study into the motivations of internet users contributing to
translation crowdsourcing: the case of Polish Facebook user-translators. PhD thesis,
Dublin City University.
Drugan, J. (2014). Quality in Professional Translation. Bloomsbury Academic, London.
Dunne, K. (2011a). From vicious to virtuous cycle customer-focused translation quality
management using iso 9001 principles and agile methodologies. In Dunne, K. J. and
Dunne, E., editors, Translation and Localization Project Management: The Art of the
Possible, pages 153–88, John Benjamins Publishing, Amsterdam
Dunne, K. (2011b). Managing the fourth dimension: Time and schedule in translation and
localization project. In Dunne, K. J. and Dunne, E., editors, Translation and Localization
Project Management: The Art of the Possible, pages 119–52, American Translators
Association Scholarly Monograph Series, John Benjamins Publishing, Amsterdam.
Dunne, K. J. and Dunne, E. S. (2011). Translation and Localization Project Management:
The Art of the Possible. John Benjamins Publishing, Amsterdam.
Elming, J. and Bonk, R. (2012). The Casmacat workbench: a tool for investigating the
integration of technology in translation. In Proceedings of the International Workshop
on Expertise in Translation and Post-editing – Research and Application, Copenhagen,
Denmark.
Esselink, B. (2000). A Practical Guide to Localization. John Benjamins Publishing, Amsterdam.
Esselink, B. (2001). Web design: Going native. Language International, 2: 16–18.
Esselink, B. (2003a). The evolution of localization. The Guide from Multilingual Computing
& Technology: Localization, 14(5): 4–7.
Bibliography 197
Esselink, B. (2003b). Localisation and translation. In Somers, H., editor, Computers and
Translation: A Translator’s Guide, pages 67–86, John Benjamins Publishing, Amsterdam.
Federico, M., Bertoldi, N., and Cettolo, M. (2008). IRSTLM: an open source toolkit for
handling large scale language models. In Interspeech ’08, pages 1618–21.
Federmann, C., Eisele, A., Uszkoreit, H., Chen, Y., Hunsicker, S., and Xu, J. (2010).
Further experiments with shallow hybrid MT systems. In Proceedings of the Joint Fifth
Workshop on Statistical Machine Translation and MetricsMATR, pages 77–81, Uppsala,
Sweden, Association for Computational Linguistics, New York.
Flournoy, R. and Duran, C. (2009). Machine translation and document localization
production at Adobe: From pilot to production. In Proceedings of the Machine Translation
Summit XII, Ottawa, Canada.
Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz,
J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., and Tyers, F. M. (2011). Apertium:
a free/open-source platform for rule-based machine translation. Machine Translation,
25(2): 127–44.
Friedl, J. (2006). Mastering Regular Expressions. O’Reilly Media, Inc., Sebastopol, CA, 3rd
edition.
Fukuoka, W., Kojima, Y., and Spyridakis, J. (1999). Illustrations in user manuals: Preference
and effectiveness with Japanese and American readers. Technical Communication,
46(2): 167–76.
Gallup, O. (2011). User language preferences online. Technical report, European
Commission, Brussels.
Gauld, A. (2000). Learn to Program Using Python: A Tutorial for Hobbyists, Self-Starters, and
All Who Want to Learn the Art of Computer Programming. Addison-Wesley Professional,
Reading, MA
Gdaniec, C. (1994). The Logos translatability index. In Technology Partnerships for
Crossing the Language Barrier: Proceedings of the First Conference of the Association for
Machine Translation in The Americas, pages 97–105, Columbia, MD.
Gerson, S. J. and Gerson, S. M. (2000). Technical Writing: Process and Product. Prentice
Hall, Upper Saddle River, NJ.
Giammarresi, S. (2011). Strategic views on localisation project management: The
importance of global product management and portfolio management. In Dunne, K. J
and Dunne, E., editors, Translation and Localization Project Management: The Art of the
Possible, pages 17–50, American Translators Association Scholarly Monograph Series,
John Benjamins Publishing Company, Amsterdam.
Godden, K. (1998). Controlling the business environment for controlled language. In
Proceedings of the Second Controlled Language Application Workshop (CLAW), pages
185–9, Pittsburgh, PA
Godden, K. and Means, L. (1996). The controlled automotive service language (CASL)
project. In Proceedings of the First Controlled Language Application Workshop (CLAW
1996), pages 106–14, Leuven, Belgium.
Hall, B. (2009). Globalization Handbook for the Microsoft .Net Platform. CreateSpace.
Hammerich, I. and Harrison, C. (2002). Developing Online Content: The Principles of
Writing and Editing for the Web. John Wiley & Sons, Inc., Toronto, Canada.
Hayes, P., Maxwell, S., and Schmandt, L. (1996). Controlled English advantages for
translated and original English documents. In Proceedings of the First Controlled Language
Application Workshop (CLAW 1996), pages 84–92, Leuven, Belgium.
He, Y., Ma, Y., Roturier, J., Way, A., and van Genabith, J. (2010). Improving the post-
editing experience using translation recommendation: a user study. In Proceedings of the
198 Bibliography
Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010),
pages 247–56, Denver, CO, Association for Machine Translation in the Americas.
Hearne, M. and Way, A. (2011). Statistical machine translation: A guide for linguists and
translators. Language and Linguistics Compass, 5(5): 205–26.
International, D. (2003). Developing International Software. Microsoft Press, Redmond,
WA, 2nd edition.
Jiménez-Crespo, M. A. (2011). From many one: Novel approaches to translation quality
in a social network era. In O’Hagan, M., editor, Linguistica Antverpiensia New Series –
Themes in Translation Studies: Translation as a Social Activity – Community Translation
2.0, pages 131–52, Artesis University College, Antwerp.
Jiménez-Crespo, M. A. (2013). Translation and Web Localization. Routledge, London.
Kamprath, C., Adolphson, E., Mitamura, T., and Nyberg, E. (1998). Controlled language
for multilingual document production: Experience with caterpillar technical English.
In CLAW ’98: 2nd International Workshop on Controlled Language Applications,
Pittsburgh, PA.
Kaplan, M. (2000). Internationalization with Visual Basic: The Authoritative Solution. Sams
Publishing, Indianopolis, IN.
Karsch, B. I. (2006). Terminology workflow in the localization process. In Dunne, K.
J., editor, Perspectives on Localization, pages 173–91, John Benjamins Publishing,
Amsterdam.
Kelly, N., Rav, R., and DePalma, D. A. (2011). From crawling to sprinting: Community
translation goes mainstream. In O’Hagan, M., editor, Linguistica Antverpiensia New
Series – Themes in Translation Studies: Translation as a Social Activity – Community
Translation 2.0, pages 75–94, Artesis University College, Antwerp, 10th edition.
Knight, K. and Chander, I. (1994). Automated post-editing of documents. In Proceedings
of the Twelfth National Conference on Artificial Intelligence (Vol. 1), pages 779–84,
American Association for Artificial Intelligence, Seattle, WA.
Koehn, P. (2010a). Enabling monolingual translators: Post-editing vs. options. In Proceedings
of Human Language Technologies: The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pages 537–45, Los Angeles, CA.
Koehn, P. (2010b). Statistical Machine Translation. Cambridge University Press, Cambridge.
Koehn, P., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Moran,
C., Dyer, C., Constantin, A., and Herbst, E. (2007). Moses: Open source toolkit for
statistical machine translation. In ACL-2007: Proceedings of Demo and Poster Sessions,
Prague, Czech Republic.
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology – NAACL ’03, pages 48–54,
Association for Computational Linguistics, Morristown, NJ.
Kohavi, R., Longbotham, R. Sommerfield, D., and Henne, R. M. (2009). Controlled
experiments on the web: survey and practical guide. Data Mining and Knowledge
Discovery, 18(1): 140–81.
Kohl, J. R. (2008). The Global English Style Guide: Writing Clear, Translatable Documentation
for a Global Market. SAS Institute, Cary, NC.
Krings, H. P. (2001). Repairing Texts: Empirical Investigations of Machine Translation Post-
Editing Process. The Kent State University Press, Kent, OH.
Kumaran, A., Saravanan, K., and Maurice, S. (2008). wikiBABEL: community creation
of multilingual data. In Proceedings of the Fourth International Symposium on Wikis,
WikiSym ’08, New York, NY, ACM, New York.
Bibliography 199
Künzli, A. (2007). The ethical dimension of translation revision, an empirical study. The
Journal of Specialised Translation, 8: 42–56.
Lagoudaki, E. (2009). Translation editing environments. In MT Summit XII, The Twelfth
Machine Translation Summit: Beyond Translation Memories: New Tools for Translators
Workshop, Ottawa, Canada.
Langlais, P. and Lapalme, G. (2002). TransType: Development-evaluation cycles to boost
translator’s productivity. Machine Translation, 17(2): 77–98.
Lardilleux, A. and Lepage, Y. (2009). Sampling-based multilingual alignment. In
International Conference on Recent Advances in Natural Language Processing (RANLP
2009), Borovets, Bulgaria.
Lo, C.-K. and Wu, D. (2011). MEANT: An inexpensive, high-accuracy, semi-automatic
metric for evaluating translation utility via semantic frames. In Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1, pages 220–9, Association for Computational Linguistics,
Morristown. NJ.
Lombard, R. (2006). A practical case for managing source-language terminology.
In Dunne, K. J., editor, Perspectives on Localization, pages 155–71, John Benjamins
Publishing, Amsterdam.
Lutz, M. (2009). Learning Python. O’Reilly & Associates, Inc., Sebastopol, CA, 4th
edition.
Lux, V. and Dauphin, E. (1996). Corpus studies: a contribution to the definition of a
controlled language. In Proceedings of the First Controlled Language Application Workshop
(CLAW 1996), pages 193–204, Leuven, Belgium.
McDonough Dolmaya, J. (2011). The ethics of crowdsourcing. In O’Hagan, M., editor,
Linguistica Antverpiensia New Series – Themes in Translation Studies: Translation as a
Social Activity – Community Translation 2.0, pages 97–110, Artesis University College,
Antwerp, 10th edition.
McNeil, J. (2010). Python 2.6 Text Processing: Beginners Guide. Packt Publishing Ltd,
Birmingham.
Melby, A. K. and Snow, T. A. (2013). Linport as a standard for interoperability between
translation systems. Localisation Focus, 12(l): 50–55.
Microsoft (2011). French Style Guide. Microsoft, Redmond,WA.
Mitamura, T., Nyberg, E. and Carbonell, J. (1991). An efficient interlingua translation
system for multilingual document production. In Proceedings of the Third Machine
Translation Summit, Washington, DC, pages 2–4.
Moore, C. (2000). Controlled language at Diebold Incorporated. In Proceedings of the
Third International Workshop on Controlled Language Applications (CLAW 2000), pages
51–61, Seattle, WA.
Moorkens, J. (2011). Translation memories guarantee consistency: Truth or fiction? In
Proceedings of ASLIB 2011, London.
Moorkens, J. and O’Brien, S. (2013). User attitudes to the post-editing interface.
In O’Brien, S., Simard, M., and Specia, L., editors, Proceedings of MT Summit XIV
Workshop on Post-editing Technology and Practice, pages 19–25, Nice, France.
Muegge, U. (2001). The best of two worlds: Integrating machine translation into standard
translation memories. a universal approach based on the TMX standard. Language
International, 13(6): 26–9.
Myerson, C. (2001). Global economy: Beyond the hype. Language International, 1: 12–15.
Nielsen, J. (1999). Designing Web Usability: The Practice of Simplicity. New Riders
Publishing, Thousand Oaks, CA.
200 Bibliography
Nyberg, E., Mitamura, T., and Huijsen, W. O. (2003). Controlled language for authoring
and translation. In Somers, H., editor, Computers and Translation: A Translator’s Guide,
pages 245–81, John Benjamins Publishing Company, Amsterdam.
O’Brien, S. (2002). Teaching post-editing: A proposal for course content. In 6th EAMT
Workshop ‘Teaching Machine Translation’, Manchester, pages 99–106.
O’Brien, S. (2003). Controlling controlled English: An analysis of several controlled
language rule sets. In Proceedings of EAMT-CLAW-03, pages 105–14, Dublin, Ireland.
O’Brien, S. (2014). Error typology benchmarking report. Technical report, TAUS Labs,
Amsterdam.
O’Brien, S. and Schäler, R. (2010). Next generation translation and localization: Users
are taking charge. In Proceedings of Translating and the Computer 32, Aslib, London.
O’Brien, S., Moorkens, J., and Vreeke, J. (2014). Kanjingo – a mobile app for post-editing.
In EAMT2014: The Seventeenth Annual Conference of the European Association for
Machine Translation (EAMT), pages 137–41, Dubrovnik, Croatia.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In
Proceedings of the 41st Annual Meeting on Association for Computational Linguistics,
volume 1, pages 160–7, Sapporo, Japan.
Och, F. J. and Ney, H. (2002). Discriminative training and maximum entropy models for
statistical machine translation. In Proceedings of the 40th Annual Meeting on Association
for Computational Linguistics, pages 295–302, Association for Computational
Linguistics, Stroudsburg, PA.
Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1): 19–51.
Ogden, C. K. (1930). Basic English: A General Introduction with Rules and Grammar. Paul
Treber, London.
O’Hagan, M. and Ashworth, D. (2002). Translation-mediated Communication in a Digital
World: Facing the Challenges of Globalization and Localization, volume 23, Multilingual
Matters, Clevedon.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL 2002), pages 311–18, Philadelphia, PA.
Pedersen, J. (2009). A subtitler’s guide to translating culture. MultiLingual, 20(3): 44–48.
Perez, F. and Granger, B. E. (2007). IPython: a system for interactive scientific computing.
Computing in Science & Engineering, 9(3): 21–9.
Perkins, J. (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing,
Birmingham.
Pfeiffer, S. (2010). The Definitive Guide to HTML5 Video. Apress, New York.
Pym, A. (2004). The Moving Text: Localization, Translation, and Distribution, volume 49,
John Benjamins Publishing, Amsterdam.
Pym, P. J. (1990). Preediting and the use of simplified writing for MT: an engineer’s
experience of operating an MT system. In Mayorcas, P., editor, Translating and the
Computer 10: The Translation Environment 10 Years on, pages 80–96, ASLIB, London.
Raman, M. and Sharma, S. (2004). Technical Communication: Principles and Practice.
Oxford University Press, Oxford.
Ray, R. and Kelly, N. (2010). Reaching New Markets Through Transcreation. Common
Sense Advisory, Lowell, MA.
Richardson, S. D. (2004). Machine translation of online product support articles using
a data-driven MT system. In Frederking, R. and Taylor, K., editors, Proceedings of the
Bibliography 201
6th Conference of the Association for MT in the Americas, AMTA 2004, pages 246–51,
Washington, DC, Springer-Verlag, New York.
Rockley, A., Kostur, P., and Manning, S. (2002). Managing Enterprise Content: A Unified
Content Strategy. New Riders, Indianapolis, IN.
Roturier, J. (2006). An investigation into the impact of controlled English rules on
the comprehensibility, usefulness and acceptability of machine-translated technical
documentation for French and German users. PhD thesis, Dublin City University,
Ireland.
Roturier, J. (2009). Deploying novel MT technology to raise the bar for quality: A review
of key advantages and challenges. In MT Summit XII: Proceedings of the Twelfth Machine
Translation Summit, Ottawa, Canada.
Roturier, J. and Lehmann, S. (2009). How to treat GUI options in IT technical texts for
authoring and machine translation. The Journal of Internationalisation and Localisation,
1: 40–59.
Roturier, J., Mitchell, L., and Silva, D. (2013). The ACCEPT post-editing environment:
a flexible and customisable online tool to perform and analyse machine translation
post-editing. In O’Brien, S., Simard, M., and Specia, L., editors, Proceedings of the MT
Summit XIV Workshop on Post-editing Technology and Practice (WPTP 2013), Nice,
France.
Rubino, R., Wagner, J., Foster, J., Roturier, J., Samad Zadeh Kaljahi, R. and Hollowood,
F. (2013). DCU-Svmantec at the WM T 2013 quality estimation shared task. In
Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 392–7, Sofia,
Bulgaria.
Savourel, Y. (2001). XML Internationalization. Sams, Indianopolis, IN.
Schwitter, R. (2002). English as a formal specification language. In Proceedings of the 13th
International Workshop on Database and Expert Systems Applications, pages 228–32.
Senellart, J., Yang, J., and Rebollo, A. (2003). Systran intuitive coding technology. In
Proceedings of MT Summit X, New Orleans, LA.
Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. (2007). Rule-based translation with
statistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical
Machine Translation – StatMT ’07, pages 203–6, Association for Computational
Linguistics, Morristown, NJ.
Smith, J., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., and Lopez, A.
(2013). Dirt cheap web-scale parallel text from the common crawl. In Proceedings of
ACL 2013, Sofia, Bulgaria.
Smith-Ferrier, G. (2006). .NET Internationalization: The Developer’s Guide to Building
Global Windows and Web Applications. Addison-Wesley Professional, Upper Saddle
River, NJ.
Snover, M., Dorr, B., Schwartz, R. Micciulla, L., and Makhoul, J. (2006). A study of
translation edit rate with targeted human annotation. In Proceedings of the Seventh
Conference of the Association for Machine Translation of the Americas, Cambridge, MA.
Somers, H. (2003). Machine translation: Latest developments. In Mitkov, R. editor, The
Oxford Handbook of Computational Linguistics, pages 512–28, Oxford University Press,
New York.
Soricut, R., Bach, N., and Wang, Z. (2012). The SDL language weaver systems in
the WMT12 quality estimation shared task. In Proceedings of the Seventh Workshop
on Statistical Machine Translation, pages 145–51, Association for Computational
Linguistics, Morristown, MA.
202 Bibliography
Souphavanh, A. and Karoonbooyanan, T. (2005). Free/Open Source Software: Localization.
United Nations Development Programme–Asia Pacific Development Information
Programme.
Spyridakis, J. (2000). Guidelines for authoring comprehensible web pages and evaluating
their success. Technical Communication, 47(3): 301–10.
Steichen, B. and Wade, V. (2010). Adaptive retrieval and composition of socio-semantic
content for personalised customer care. In International Workshop on Adaptation in
Social and Semantic Web, pages 1–10, Honolulu, HI.
Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of the
Seventh International Conference on Spoken Language Processing (ICSLP 2002), Denver,
CO.
Surcin, S., Lange, E., and Senellart, J. (2007). Rapid development of new language pairs at
Systran. In Proceedings of MT Summit XI, pages 10–14, Copenhagen, Denmark.
Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In Calzolari, N., Choukri,
K., Declerck, T., Dogan, M. U., Maegaard, B., Mariani, J., Odijk, J., and Piperidis,
S., editors, Proceedings of the Eighth International Conference on Language Resources and
Evaluation (LREC’12), Istanbul, Turkey, European Language Resources Association
(ELRA), Paris.
Toker, D., Conati, C., Steichen, B., and Carenini, G. (2013). Individual user characteristics
and information visualization: connecting the dots through eye tracking. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, pages 295–304,
ACM.
Torresi, I. (2010). Translating Promotional and Advertising Texts. St. Jerome Publishing,
Manchester.
Turian, J., Shen, L., and Melamed, D. (2003). Evaluation of machine translation and its
evaluation. In Proceedings of MT Summit IX, pages 61–3, Edmonton, Canada.
Underwood, N. and Jongejan, B. (2001). Translatability checker: a tool to help decide
whether to use MT. In Proceedings of MT Summit VIII, Santiago de Compostela, Spain.
Van Genabith, J. (2009). Next generation localisation. Localisation Focus: The International
Journal of Localisation, 8(1): 4–10.
Vasiļjevs, A., Skadiņš, R., and Tiedemann, J. (2012). Letsmt!: A cloud-based platform
for do-it-yourself machine translation. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics (ACL2012), pages 43–8, Jeju, Republic of
Korea.
Vatanen, T., Väyrynen, J. J., and Virpioja, S. (2010). Language identification of short text
segments with n-gram models. In Calzolari, N., Choukri, K., Maegaard, B., Mariani,
J., Odijk, J., Piperidis, S., Rosner, M., and Tapias, D., editors, Proceedings of the Seventh
International Conference on Language Resources and Evaluation (LREC-2010), Valetta,
Malta, European Language Resources Association, Paris.
Wagner, E. (1985). Post-editing Systran: A challenge for commission translators.
Terminologie & Traduction, 3.
Wass, E. S. (2003). Addressing the World: National Identity and Internet Country Code
Domains. Rowman & Littlefield, Lanham, MD.
Wojcik, R. and Holmback, H. (1996). Getting a controlled language off the ground at
Boeing. In Proceedings of the First Controlled Language Application Workshop (CLAW
1996), pages 114–23, Leuven, Belgium.
Yang, J. and Lange, E. (2003). Going live on the internet. In Somers, H., editor, Computers
and Translation: A Translator’s Guide, pages 191–210, John Benjamins Publishing,
Amsterdam.
Bibliography 203
Yunker, J. (2003). Beyond Borders: Web Globalization Strategies. New Riders, San Francisco,
CA.
Yunker, J. (2010). The Art of the Global Gateway. Byte Level Research LLC, Ashland, OR.
Zouncourides-Lull, A. (2011). Applying PMI methodology to translation and localization
projects: Project integration management. In Dunne, K. J. and Dunne, E., editors,
Translation and Localization Project Management: The Art of the Possible, pages 71–94,
American Translators Association Scholarly Monograph Series, John Benjamins
Publishing Company, Amsterdam.
Index
.NET framework 49, 64, 67, 94 catalog file 33–4; compilation 89;
generation 86–7, 96; see also PO;
ACCEPT 80, 146, 190 RESX
Accept-Language HTTP header 176 CDN 180
access: challenge 6; to context during characters: corrupted 152; display 46;
translation 65; to Web content 68; escape 32–3; language 22, 107;
via the global gateway 58–60 processing 58, 100; selection of 26–7;
adaptation 1, 9–10, 165; audio 169–70; sequences of 20; shortcut 62; syntax
functionality 57–8, 177–180; 99; wildcard 39; see also ASCII;
graphics 166–9; location 180–2; encodings; hotkeys; input; ITS; tag;
strings 20; textual 173–7; video translation memory; Unicode
170–2 checking: appropriateness 169;
Adobe: Flash 14–15; FrameMaker 52; controlled language 74; language
globalization at 129; MT post-editing 74–7, 80, 179; rule creation 81;
at 105; transcreation at 175 terminology 134–5; translation
AECMA 72, 74 quality 152–7, 161
agile 17, 95 CheckMate 152–4
Amara 170–1 CL see controlled language
Android 18, 186; app localization 114; CNGL 145
App Translation Service 120–1; command line: accessing an online Bash
speech-based applications 7, 180; 44; building an SMT system using
translation guidelines 128 Moses and 161; listing directories
Anymalign 132–3 using 39; running a Python program
API 118, 120, 133–4, 161, 190–1 from 43–4; starting a Python prompt
app 1–3; global 9–10, 50–5; lifecycle using 40–1
164; stores 167–8; Web 18; see also compilation: code 20, 94;
Android; iOS documentation 107; resources 61,
Apple see iOS, OSX 89, 96
application see app Computer Assisted Translation 9; see
ASCII: character-encoding scheme 22; also machine translation; post-
non-ASCII characters 56, 60 editing; translation environment;
translation memory
Bash see command line controlled language 71–5, 109
bigram see n-gram
BLEU 142 DBCS 22
bugs 19, 151 dictionary: for segmentation 140;
machine translation 136–7;
CASMACAT 147 normalization 137; Python 30;
CAT see Computer Assisted Translation search 147; word form 76, 131, 135
Index 205
DITA 35, 83 171; translation 86–9, 105–6, 150;
Django framework 50–1; writing 70–1
internationalization 57, 61–4;
merging and compilation 89; pseudo- hardware 4, 14, 57
localization 67; string extraction 86 hotkeys 87, 90–2, 95, 112, 125
Docbook 35, 52, 69, 107 HTER 142
domain name 59–60 HTML 7, 18; annotation 157; audio 69;
DTD 34–7 conversion from text to 99; Django
DTP 52, 117 templates 62–3; documentation 35,
DVD 8 50, 54–5; editing 98; HTML5 14,
69, 98; internationalization 56, 188;
email: address of technical support static Web sites using 51; translation
contact 10; classification of 179; for 87–8, 105, 152, 159; for Web apps
communication 123; for sending 50, 52
documents to translation 117; hyperlink 88, 108
templates localization 6
encodings 1, 21–6, 46, 56, 107, 181; see i18n see internationalization
also ASCII; GB18030; UTF-8 icon 59, 93, 107, 169
EPUB 54 ICU 57, 101
evaluation: frameworks 157; machine iOS 18, 114, 121, 186
translation 141–3; post-editing 148; IME 57
quality 122; translation 152 input 2, 25, 28, 32, 56–7, 192–3
internationalization 1, 9, 49; content
Facebook 121–2 68–71; Python strings 60–8; software
files 8, 11, 33; see also catalog file; 55–8
encodings; HTML, PO; RESX; TBX; IP address 59
TMX; XML IPython 42–3
FO see XSL IRSTLM 141, 156
formats: data exchange 119; date 57–8; ISO code 59–60
file 7, 14, 22, 64, 98–9, 178; source ITS 69–70, 157
code 112; see also files; strings;
terminology Java 64, 83, 187; see also LanguageTool
FTP 117 JavaScript 50–1, 188