0% found this document useful (0 votes)
281 views222 pages

Localizing Apps A Practical Guide For Translators and Translation Students

Uploaded by

Shaimaa Saeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
281 views222 pages

Localizing Apps A Practical Guide For Translators and Translation Students

Uploaded by

Shaimaa Saeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 222

Localizing Apps

The software industry has undergone rapid development since the beginning of
the twenty-first century. These changes have had a profound impact on translators
who, due to the evolving nature of digital content, are under increasing pressure
to adapt their ways of working. Localizing Apps looks at these challenges by
focusing on the localization of software applications, or apps. In each of the five
core chapters, Johann Roturier examines:

• the role of translation and other linguistic activities in adapting software to


the needs of different cultures (localization);
• the procedures required to prepare source content before it gets localized
(internationalization);
• the measures taken by software companies to guarantee the quality and
success of a localized app.

With practical tasks, suggestions for further reading and concise chapter
summaries, Localizing Apps takes a comprehensive look at the transformation
processes and tools used by the software industry today.
This text is essential reading for students, researchers and translators working
in the areas of translation and creative digital media.

Johann Roturier works as a senior principal research engineer in Symantec


Research Labs. He worked for ten years in the localization industry, where he
held various positions ranging from freelance translator and linguistic quality
assurance tester to researcher and open-source project manager. His research
interests include multilingual text analysis and human factors in machine
translation.
Translation Practices Explained
Series Editor: Kelly Washbourne
Translation Practices Explained is a series of coursebooks designed to help self-
learners and students on translation and interpreting courses. Each volume
focuses on a specific aspect of professional translation practice, in many cases
corresponding to courses available in translator-training institutions. Special
volumes are devoted to well consolidated professional areas, to areas where
labour-market demands are currently undergoing considerable growth, and to
specific aspects of professional practices on which little teaching and learning
material is available. The authors are practicing translators or translator trainers
in the fields concerned. Although specialists, they explain their professional
insights in a manner accessible to the wider learning public.
These books start from the recognition that professional translation practices
require something more than elaborate abstraction or fixed methodologies. They
are located close to work on authentic texts, and encourage learners to proceed
inductively, solving problems as they arise from examples and case studies.
Each volume includes activities and exercises designed to help learners
consolidate their knowledge (teachers may also find these useful for direct
application in class, or alternatively as the basis for the design and preparation
of their own material.) Updated reading lists and website addresses will also help
individual learners gain further insight into the realities of professional practice.

Titles in the series:

Translating for the European Union Subtitling Through Speech


Institutions 2e Recognition
Emma Wagner, Svend Bech and Pablo Romero-Fresco
Jesús M. Martínez
Translating Promotional and
Revising and Editing for Advertising Texts
Translators 3e Ira Torresi
Brian Mossop
Audiovisual Translation,
Audiovisual Translation Subtitling
Frederic Chaume Jorge Diaz-Cintas and Aline Remael

Scientific and Technical Medical Translation Step by Step


Translation Explained Vicent Montalt and Maria González-Davies
Jody Byrne
Notetaking for Consecutive
Translation-Driven Corpora Interpreting
Federico Zanettin Andrew Gillies
Translating Official Documents Electronic Tools for Translators
Roberto Mayoral Asensio Frank Austermuhl

Conference Interpreting Explained Introduction to Court Interpreting


Roderick Jones Holly Mikkelson

Legal Translation Explained


Enrique Alcaraz, Brian Hughes

For more information on any of these titles, or to order, please go to www.


routledge.com/linguistics
This page intentionally left blank
Localizing Apps
A practical guide for translators and
translation students

Johann Roturier
First published 2015
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
711 Third Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2015 Johann Roturier
The right of Johann Roturier to be identified as the author of this
work has been asserted by him in accordance with sections 77 and 78
of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or
reproduced or utilized in any form or by any electronic, mechanical,
or other means, now known or hereafter invented, including
photocopying and recording, or in any information storage or retrieval
system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks
or registered trademarks, and are used only for identification and
explanation without intent to infringe.
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book has been requested
ISBN: 978-1-138-80358-9 (hbk)
ISBN: 978-1-138-80359-6 (pbk)
ISBN: 978-1-315-75362-1 (ebk)
Typeset in Goudy
by HWA Text and Data Management, London
Contents

List of figures xi
List of listings xii
Acknowledgments xiv

1 Introduction 1
1.1 Context for this book 1
1.1.1 Everything is an app 1
1.1.2 The language challenge 3
1.1.3 The need for localization 4
1.1.4 New challenges affecting the localization industry 6
1.2 Why a new book on this topic? 8
1.3 Conceptual framework and key terminology 9
1.4 Who is this book for? 10
1.5 Book structure 12
1.6 What this book does not cover 14
1.7 Conventions 15

2 Programming basics 16
2.1 Software development trends 17
2.2 Programming languages 18
2.3 Encodings 21
2.3.1 Overview 21
2.3.2 Dealing with encodings using Python 23
2.4 Software strings 26
2.4.1 Concatenating strings 28
2.4.2 Special character in strings 32
2.5 Files 33
2.5.1 PO 33
2.5.2 XML 34
2.6 Regular expressions 38
2.7 Tasks 39
2.7.1 Setting up a working Python environment 40
2.7.2 Executing Python statements using a command prompt 41
viii Contents
2.7.3 Creating a small Python program 43
2.7.4 Running a Python program from the command line 43
2.7.5 Running Python commands from the command line 44
2.7.6 Completing a tutorial on regular expressions 45
2.7.7 Performing contextual replacements with regular expressions
(advanced) 45
2.7.8 Dealing with encodings (advanced) 46
2.8 Further reading and resources 46

3 Internationalization 49
3.1 Global apps 50
3.1.1 Components 50
3.1.2 Reuse 52
3.2 Internationalization of software 55
3.2.1 What is internationalization? 55
3.2.2 Engineering tasks 56
3.2.3 Traditional approach to the i18n and l10n of software strings 60
3.2.4 Additional internationalization techniques 64
3.3 Internationalization of content 68
3.3.1 Global content from a structural perspective 68
3.3.2 Global content from a stylistic perspective 70
3.4 Tasks 76
3.4.1 Evaluating the effectiveness of global gateways 77
3.4.2 Internationalizing source Python code 77
3.4.3 Extracting text from an XML file 79
3.4.4 Checking text with LanguageTool 80
3.4.5 Assessing the impact of source characteristics on machine
translation 80
3.4.6 Creating a new checking rule 81
3.5 Further reading 81

4 Localization basics 85
4.1 Introduction 85
4.2 Localization of software content 86
4.2.1 Extraction 86
4.2.2 Translation and translation guidelines 86
4.2.3 Merging and compilation 89
4.2.4 Testing 92
4.2.5 Binary localization 94
4.2.6 Project updates 95
4.2.7 Automation 96
4.2.8 In-context localization 97
4.3 Localization of user assistance content 98
4.3.1 Translation kit creation 100
4.3.2 Segmentation 100
Contents ix
4.3.3 Content reuse 103
4.3.4 Segment-level reuse 104
4.3.5 Translation guidelines 105
4.3.6 Testing 106
4.3.7 Other documentation components 106
4.4 Localization of information content 108
4.4.1 Characteristics of online information content 108
4.4.2 Online machine translation 109
4.5 Conclusions 109
4.6 Tasks 110
4.6.1 Localizing software strings using an online localization
environment 110
4.6.2 Translating user assistance content 112
4.6.3 Evaluating the effectiveness of translation guidelines 113
4.7 Further reading and resources 114

5 Translation technology 116


5.1 Translation management systems and workflows 117
5.1.1 High-level characteristics 117
5.1.2 API-driven translation 120
5.1.3 Integrated translation 120
5.1.4 Collaborative translation 121
5.1.5 Crowdsourcing-based translation 122
5.2 Translation environment 123
5.2.1 Web-based 124
5.2.2 Desktop-based 125
5.3 Translation memory 126
5.4 Terminology 127
5.4.1 Why terminology matters 127
5.4.2 Monolingual extraction of candidate terms 129
5.4.3 Acquisition of term translations 132
5.4.4 Terminology repositories and glossaries 133
5.5 Machine translation 135
5.5.1 Rules-based machine translation 136
5.5.2 Statistical machine translation 137
5.5.3 Hybrid machine translation 143
5.6 Post-editing 144
5.6.1 Types of post-editing 145
5.6.2 Post-editing tools 146
5.6.3 Post-editing analysis 148
5.7 Translation quality assurance 149
5.7.1 Actors 149
5.7.2 Manual checks 151
5.7.3 Rules-based checks 152
5.7.4 Statistical checks 155
x Contents
5.7.5 Machine learning-based checks 155
5.7.6 Quality standards 157
5.8 Conclusions 158
5.9 Tasks 159
5.9.1 Reviewing the terms and conditions of an online translation
management system 159
5.9.2 Becoming familiar with a new translation environment 159
5.9.3 Building a machine translation system and doing some post-
editing 160
5.9.4 Checking text and making global replacements 161
5.10 Further reading and resources 161

6 Advanced localization 164


6.1 Adaptation of non-textual content 166
6.1.1 Screenshots 166
6.1.2 Other graphic types 168
6.1.3 Audio and video 169
6.2 Adaptation of textual content 173
6.2.1 Transcreation 174
6.2.2 Personalization 175
6.3 Adaptation of functionality 177
6.3.1 Regulatory compliance 177
6.3.2 Services 178
6.3.3 Core functionality 179
6.4 Adaptation of location 180
6.5 Conclusions 182
6.6 Tasks 182
6.6.1 Understanding transcreation 182
6.6.2 Adapting functionality 183

7 Conclusions 185
7.1 Programming 185
7.2 Internationalization 186
7.3 Localization 188
7.4 Translation 189
7.5 Adaptation 190
7.6 New directions 191
7.6.1 Towards real-time text localization 191
7.6.2 Beyond text localization 192

Bibliography 194
Index 204
Figures

1.1 Components of a software application’s ecosystem 2


1.2 Conceptual framework for app globalization 10
2.1 Creating a Unicode string in Python 23
2.2 Saving a file as UTF-8 in a text editor 24
2.3 Selecting a console in PythonAnywhere 42
2.4 Writing a Python program using a text editor 43
3.1 Documentation in HTML format 55
3.2 Documentation in PDF format 55
3.3 Global gateway 58
3.4 Output of xgettext viewed in Virtaal 61
3.5 Pseudo-localized application 67
3.6 Rule violations detected by LanguageTool 75
3.7 Tagging a text to better understand checking results 76
3.8 False alarms reported on XML content 79
3.9 Creating a simple rule using regular expressions 82
3.10 Results provided by a newly created rule 82
4.1 Steps in traditional globalization workflow 85
4.2 Hotkeys in Python application using the TkInter toolkit 90
4.3 Translating strings in Pontoon 97
4.4 Translating strings interactively in Pontoon 98
4.5 Editing segmentation rules in Ratel 102
4.6 Online Pootle translation environment 113
5.1 Visualizing project progress in Transifex 119
5.2 Online translation environment in Transifex 125
5.3 Extracting candidate terms with Rainbow 130
5.4 The SymEval interface 148
5.5 Selecting check options in Transifex 153
5.6 Checking a TMX file with CheckMate 154
5.7 Configuring CheckMate patterns using regular expressions 154
5.8 Specifying data sets for the training phase 160
6.1 Application life cycle: a user perspective 164
6.2 Add-on repository for the Firefox Web browser 168
6.3 Amara subtitling environment 170
6.4 Language preferences 176
Listings

2.1 Example of Python code from the developer’s perspective 19


2.2 Reading the content of a file using Python 2.x 24
2.3 Decoding the content of a file into a Unicode string using
Python 2.x 25
2.4 Selecting specific characters from a string using their position 26
2.5 Secret game program: the developer’s view 27
2.6 Secret game program: the user’s view 28
2.7 Revised secret game program 29
2.8 Secret game program: the user’s view 29
2.9 Example of string formatting 30
2.10 Second example of string formatting 30
2.11 Final example of string formatting 31
2.12 Combining objects in Python 31
2.13 Secret game program: another user view 32
2.14 Structure of an entry in PO file 34
2.15 Example of a Docbook snippet 35
2.16 Part of a TMX file 37
2.17 Example of an XLIFF file 38
2.18 Use of wildcard to find specific files in a folder 39
2.19 Starting a Python prompt from the command line 41
2.20 Python syntax error 42
2.21 Using an IPython environment 43
2.22 Python program not found 44
2.23 Running a Python program from the command line 44
2.24 Running a Python command from the command line 45
2.25 Running multiple commands from the command line 45
3.1 Documentation in source XML file 53
3.2 Documentation transformation commands using XSL 54
3.3 Internationalization of Python code in a Django application 62
3.4 Internationalization of an HTML template in a Django application 63
3.5 Externalization of source strings 64
3.6 Bad use of string concatenation 65
3.7 Using substitution markers 66
3.8 Using the ITS translate data category 69
List of listings xiii
3.9 Using an ITS data category to specify excluded characters 70
3.10 Output of xgettext 78
3.11 Expected output 78
4.1 Viewing the contents of a locale directory 89
4.2 Use of hotkeys in a TkInter application 91
4.3 Use of hotkeys in a TkInter application 92
5.1 Extracted candidate terms with Rainbow 130
5.2 Extracted candidate terms using a custom script 131
5.3 Searching term translations in Anymalign output 133
5.4 Annotating an issue in XML with ITS local standoff markup 158
Acknowledgements

A lot of people have helped me write this book. Writing this book was an
incredible journey, so I would like to start by thanking my wife, Gráinne, for
her patience, help and support, as well as family members and friends for their
encouragements. I would also like to thank the Series editors (Dr Sharon O’Brien
and Dr Richard Kelly Washbourne) for their patience and insightful comments
throughout the process. Assistance provided by Dr Kevin Farrell during
the editing phase was also greatly appreciated. I would also like to thank the
following organizations for allowing me to use screenshots of their applications:
Transifex, PythonAnywhere, Tilde, the Participatory Culture Foundation and
the Mozilla Foundation. My special thanks are extended to all people involved
in the open-source or standardization projects mentioned in this book. Finally
I would like to acknowledge all members from the ACCEPT, ConfidentMT,
CNGL and Symantec Localization teams, in particular Fred Hollowood, for all of
the stimulating conversations on localization-related topics over the years.
1 Introduction

This introductory chapter is divided into seven sections, covering the overall
context for this book, some justifications as to why a new book is required on
the topic of localization, a brief explanation of the key terminology used, the
intended audience, an overview of the book’s structure, the scope of the book,
and the conventions used throughout this book.

1.1 Context for this book


Desktop computers were introduced in the 1980s and during that decade hardware
manufacturers and software publishers realized that in order to sell their products
in other markets or countries, they would need to adapt them so that they would
still be functional in different environments. This adaptation became known as
localization since target countries or groups of countries were also referred to as
locales. Localization was required because computers at the time relied on very
different character sets so a program written in, say, Spanish and encoded in
a Western encoding would not run properly on a Japanese operating system.
Since then, localization processes have become more sophisticated and are often
coupled with internationalization processes, which aim at preparing a product for
localization. Note that, because of the length of the words internationalization
and localization, the words are commonly shortened to i18n and l10n. These
acronyms are ‘quoting the first and last letter of each word, and replacing the
run of intermediate letters by a number merely telling how many such letters
there are.’1Adapting a product to a specific market or locale is of course not
specific to the Information Technology (IT) industry, as any business hoping to
operate successfully on a global scale is likely to rely on transformation processes
so that their equipment, medicines or food products meet, or even exceed local
regulations, customs and expectations. In this book, however, the focus is on
software applications, which are also referred to as applications or apps.

1.1.1 Everything is an app


To an end-user, an app is often limited to the visual interface they use to
accomplish a specific task. Depending on the complexity of the task(s) an app
2 Introduction

Interface
•Strings
•Content

Help
Content
[Conversations]

App

Functionality
Marketing
Collaterals 'Input
'Output
Content
'Processing

Figure 1.1 Components of a software application’s ecosystem

is supposed to perform, additional components may become apparent over time.


Such components can be referred to as an application’s digital ecosystem, a non-
exhaustive example of which is presented in Figure 1.1.
Most of these components should be familiar to anybody who has ever come
across a software application in their life (be it a desktop application, a mobile
application or a Web-based application). An application is obviously equipped
with an interface, which is composed of textual strings (e.g. menu items) and
content (e.g. informative content such as news items or pictures). While most
app users probably know that help content is also available, finding and using such
content is not as frequent as using the actual application’s functionality (which is
why help or support often takes place through online or physical conversations).
An application’s functionality often relies on user input which must be processed
using procedures or algorithms in order to generate some output. Depending on
the type of application, some marketing, training and sales-related content may
also be generated, but this may not be as relevant from an end user’s perspective.
Figure 1.1 shows, however, that the ecosystem surrounding an application can
become quite large if the application turns out to be a success. As far as this
book is concerned, the focus is placed on those components that make apps
different from other content types (e.g. a perfume’s marketing brochure or a drugs
information leaflet), thus requiring specific processes such as localization.
While software localization emerged in the IT sector, it is now prevalent
in other sectors, especially those that have an online presence, be it through
Web sites (which are very often indistinguishable from Web applications) or
Web services. Any online digital content that is generated by online systems
Introduction 3
or apps, can now be subject to some form of localization in order to reach as
many users as possible. In this sense, no conceptual distinction is made in this
book between Web sites, mobile apps or desktop programs: all of these are apps,
whose digital ecosystem may vary in size depending on its user base. It should
be stressed that contributions to this digital ecosystem need not solely originate
from an authoritative app developer or publisher. App users are now increasingly
directly and indirectly taking part in various aspects of an application’s lifecycle,
ranging from funding and suggesting features to testing and writing reviews. For
instance, Kohavi et al. (2009: 177) explain that ‘software organizations shipping
classical software developed a culture where features [were] completely designed
prior to implementation. In a Web world, we can integrate customer feedback
directly through prototypes and experimentation.’ With the Web 2.0 paradigm,
publishing cycles have also been dramatically reduced, thanks to easy-to-use
online services including collaboration tools. These tools and services have
democratized the content creation process, which in turn has had an impact on
localization-related processes.

1.1.2 The language challenge


Users are spending more and more time online, thus requiring content to be
available in a language they can understand. A recent User language preferences
survey conducted upon the request of the European Commission’s Directorate-
General for Information Society and Media (Gallup Organization 2011: 7) found
that ‘a great majority of Internet users in the EU used the Internet on a daily
basis in the past four weeks: 54 per cent said they had gone online several times
a day in that time frame and 30 per cent said it had been about once a day.’ Such
figures suggest that online opportunities exist for those companies that are able to
reach users, despite potential language barriers. The area of Web localization had
already been perceived as the ‘fastest-growing area in the translation sector’ more
than a decade ago by O’Hagan and Ashworth (2002: xi) and it has never been
more relevant than today. This is no surprise considering the ever-increasing
amount of content to be translated in a very limited period of time. Even though
localization does not only involve translation, publishers are often striving for a
simultaneous publication of their information in multiple languages.
As far as multilingual Web sites are concerned, Esselink (2001: 17) warns that
‘the frequency of updates has raised the challenge of keeping all language versions
in sync (…), requiring an extremely quick turnaround time for translations.’
However, providing information before it becomes obsolete is sometimes not
possible for publishers, and some content is published exclusively in the language
in which it has been authored. Yunker (2003: 75) remarks that ‘unless the target
audience consists of only bilinguals, this approach is bound to leave people feeling
left out.’ The lack of global distribution and accessibility has been highlighted
by Pym (2004: 91) and is reflected by three types of locales: the participative
locale consists of users who are able to access information in a language they
can understand. These users are then able to act upon the information they have
4 Introduction
accessed. The observational locale consists of users for whom it is too late to do
anything with the information they access. They are able to access it in their
own language, but by the time this information is translated, it is obsolete. The
excluded locale consists of users who are never given the chance to gain access to
information in a language they can understand.
Giammarresi (2011: 17) mentions that there are two main reasons for a
company to localize its products: a reactive approach ‘if one international customer
has expressed interest in purchasing a localized version of one of the company’s
products’ or a strategic approach ‘if the company has decided to expand into
one or more new international markets’. While this may be the case in certain
scenarios, two other reasons should also be mentioned: the interest of users (not
necessarily customers) to help localize the product (from an altruistic perspective)
and the laws and regulations that are in operation in certain countries. These
four main factors driving localization-related activities (global user experience,
revenue generation, altruism and regulations) are discussed next.

1.1.3 The need for localization


The first factor concerns the user experience. For instance, the User language
preferences survey mentioned in the previous section found that while some
users feel comfortable reading or watching Web content using a language which
is different from their native language, a majority of users expect to be able to
interact with content (search, write, manipulate) in the language of their choice
(e.g. majority of Europeans). A slim majority (55 per cent) of Internet users in
the EU said that they used at least one language other than their own to read
or watch content on the Internet, while 44 per cent said that they only used
their own language. These numbers are more or less aligned with the survey
carried out by the International Data Corporation in 2000 within the framework
of the Atlas II project. Based on the results obtained from 29,000 Web users,
they had estimated that by 2003, 50 per cent of Web users in Europe would be
likely to favour sites in their native language (Myerson 2001: 14). These findings
suggest that in order to provide a truly comfortable user experience, Web sites
should offer some language support, which may involve some form of content
localization (and possibly internationalization). Regardless of the reasons for not
fully localizing online content (time, cost, lack of resources), the consequences of
having content that is only partly localized should not be underestimated.
The second factor is revenue generation. A Common Sense Advisory report
found that a major driver for any corporate involvement in global markets is
always new revenue and market share opportunities.2 DePalma et al. (2011: 2)
report that ‘high-tech hardware and equipment makers generate more than a
quarter (27.1 per cent) of their revenue account from global markets [and] oil
and gas companies earn 23.6 per cent of their income outside the United States.’
In order to be able to compete in these markets, however, companies often have
to break the language barrier, and localize some of their content, products or
services. Since this requires an upfront investment, these companies must have
Introduction 5
some level of confidence that this investment will be justified, or that they will
get some Return On their Investment (ROI). According to Zouncourides-Lull
(2011: 81), it is therefore common in localization projects to calculate costs
‘using parametric estimation, by applying standard rates (e.g. cost per word)
and possible revenue’. The pervasive use of translation memory technology has
greatly influenced this approach, by providing a quick way to calculate how much
it would cost to translate new or legacy content. When these ROI calculations
fail to convince executive sponsors, or when the prospect of having to manage
and support a number of languages is too daunting, language barriers remain,
and opportunities are lost. This is particularly visible with smaller companies,
which do not necessarily have the budgets or expertise to embrace localization.
For instance, a recent nationwide survey of Irish hotels has found that only 18
per cent of those sampled offer languages other than English on their Web site.3
The third factor is altruism. When the opportunities described earlier are lost,
volunteers sometimes decide to contribute some of their time to localize content.
This is especially visible in the IT sector with open-source projects such as
LibreOffice.4 Altruism also applies to Non-Governmental Organizations (NGO)
who rely on motivated volunteer translators. A recent survey conducted by
O’Brien and Schäler (2010: 9) found that ‘support for [The Rosetta Foundation]’s
cause and opportunities to increase professional experience emerged as the two
greatest motivating factors’. Also, volunteer-based collaborative translation or
crowdsourcing is becoming common in large for-profit corporations. This is
especially the case when the product does not necessarily need to be released in
a timely fashion or when the ROI for localizing the product is not convincing
(but a lot of enthusiastic users are willing to contribute to the localization effort).
Finally, local laws play a very important role in determining whether and how
content should be translated or localized, including, but not limited to, language
laws, data protection laws and certification laws. In terms of European language
laws, for example, the Toubon law in France, whose full name is Law 94-665 of
4 August 1994 (Article 2), mandates the usage of the French language when a
product or service is presented, offered or described (e.g. in its user manual or
terms and conditions).5 In Ireland, the Official Languages Act of 2003 provides
a number of legal rights to Irish citizens with regard to their interactions with
public bodies using the Irish language.6
Data protection laws also play a very important role when it comes to
the handling of digital content. For instance, in Germany, the Federal Data
Protection Act (Bundesdatenschutzgesetz or BDSG) of 1 January 1978 prohibits
the collection, processing and use of personal data, unless it is explicitly permitted
by law or approved, usually in writing, by the person concerned.7 This means
applications must be localized appropriately from a functionality and location
perspective when adapted for the German market.
Finally certification laws can also have an impact on the localization of an
application. For example, the US Department of Commerce mentions that
before being sold in China, software products need to be registered at the China
Software Industry Association, and the registration approved by the Ministry of
6 Introduction
Information and Industry. Besides, American firms cannot register their product
directly since registration must be made by a Chinese entity.8 This process is
even more stringent for the sale of enterprise encryption software since it needs
to comply with the Commercial Cryptography Administration Regulation.9
This example shows that local customs and regulations (including testing and
inspection procedures) can increase the complexity of a localization project.
While this type of example will not be treated in detail, it is a good reminder that
localization is more than just translation.
At the time of writing, the localization industry is also influenced by a number
of mega trends, including the increasing popularity of mobile platforms. These
trends are creating new challenges, which are changing the way traditional
localization is currently being done. Some of these challenges, ‘volume, access
and personalization,’ were identified in Van Genabith (2009: 4) and are briefly
reviewed in the next section.

1.1.4 New challenges affecting the localization industry


The volume challenge is caused by the large amount of content being created
online on a daily basis, not necessarily exclusively by an official application’s
publishers, but by a number of actors interacting with the application’s digital
ecosystem. This volume challenge is exacerbated by the velocity at which this
content is created or updated. Whereas application publishing cycles used to be
regular (involving substantial changes between two versions), content updates
tend to be more incremental, thus leading to a more continuous and prioritized
approach to localization. An example of prioritized localization is reported by
Airbnb’s Jason Katz-Brown, who acknowledges that Airbnb’s ‘websites and mobile
apps have 400,000 words of English content [, so they] couldn’t translate all of
it to Japanese in just a few days. It was important to prioritize so that the most
visible webpages, email templates, and core flows of the site were delightfully
localized before launch.’10
The personalization challenge refers mainly to monolingual content processing
(i.e. content may be adapted or personalized for a given user depending on their
level of expertise rather their linguistic preferences or expectations). The access
challenge, which was touched on earlier in this chapter, is characterized by how
people get access to and interact with online digital content, increasingly using
mobile devices. For instance, this was exemplified by the fact that more Apple
iPads than Hewlett Packard PCs were sold in the last quarter of 2011.11 This
was also exemplified by the increase of worldwide sales of smartphones (around
250 million units sold in the third quarter of 2013, up 45.8 per cent from same
quarter the year before).12 Obviously sales increases are not consistent across
all world regions, the increases being the most prominent in Asia/Pacific. This
online digital content used to be referred to as Web content, but with the advent
of mobile applications (or apps), the view that ‘we can no longer make a clear
distinction between software and content when we discuss localization’ (Esselink
2003b: 6) is perhaps more valid now than it was ten years ago. This means current
Introduction 7
localization processes must be re-visited, taking into account the impact such
changes have on the actual translation process. Related to this challenge is the
fact that more and more interactions with devices (e.g. computers or mobile
phones) are increasingly using non-textual methods. While it was common-place
for consumer software applications to be accompanied by printed manuals in the
1990s, these have been largely replaced by electronic formats (e.g. HTML or
PDF) in the last decade. At the time of writing, however, it is no longer clear
whether these text-based electronic formats will still be dominant by the end
of the 2010s. Recent advances in natural language processing (including speech
recognition and speech synthesis) have allowed speech-based applications to
gain in popularity (e.g. Apple’s Siri or Google’s Voice Actions on the Android
platform). From a localization perspective, some of these applications require
new processes, since it is no longer sufficient to simply translate computer strings
in order to help an end-user use an application or read digital content. Rather,
the application must be equipped with the (local) resources (such as text, speech
and graphic) that will allow end-users to interact with content in an effective
manner.
Apart from these three core challenges, other challenges exist, such as the
way the translation process is being conducted. The concept of collaborative
translation and localization is not new since it has been used effectively for a
number of years in IT open-source projects such as Mozilla or Linux (Souphavanh
and Karoonboonyanan 2005) or not-for-profit projects (e.g. Wikipedia). However,
it is now gaining popularity in corporate for-profit environments, as exemplified
by Twitter’s Translation Center.13 Collaborative translation is sometimes difficult
to distinguish from crowdsourced translation, which tends to rely on paid
translations (for small fees) rather than free translations. However, the people
paid using this approach may not all be professional, certified translators. This
presents both a challenge and an opportunity for professional translators. The
challenge is that work which would have been performed in a professional
capacity a few years ago can now be done faster and possibly cheaper by a number
of bilingual amateurs or hobbyists. The opportunity is that the quality of these
translations cannot be guaranteed, so reviewing or management expertise may be
sought from translators. Besides, it is unlikely hobbyists will respond well to strict
deadlines, so jobs requiring well-defined turnaround times will still be allocated
to professional translators.
Finally, machine translation (MT) is becoming more and more mainstream
in the localization industry. In the 1990s translation memory became a de facto
technology, and during the 2000s globalization management systems (GMS)
gained in popularity. Over the last five years or so, the quality of (online) machine
translation systems has improved dramatically (mainly driven by the progress in
statistical machine translation). This has led individuals and organizations to
rely on such technology to provide (basic) localized content in specific scenarios.
One of the challenges for today’s and tomorrow’s translators is to come to terms
with such technology, and be aware of the customization opportunities that can
be brought to such systems to raise the quality bar further. The use of MT is of
8 Introduction
course changing the translation process in a more dramatic way than translation
memory changed it 20 years ago. Nowadays a lot of digital content tends to be
pre-translated (using machine translation) with a view to being subsequently
post-edited by translators. Since the post-editing process is sometimes seen as
boring or tedious, a clear opportunity for translators is to become more technical
and gain expertise in areas such as text processing (i.e. the manipulation of
textual data using programmatic means) in order to have more upstream control
on the pre-translation process. By understanding better what can be achieved
through automation, it is the author’s belief that translators can then focus on
what they like doing best: translating, or being involved in a linguistic activity
(such as translation quality assurance, translation memory maintenance or MT
optimization). For this reason, this book will have a quite technical focus so that
readers can gain an insight into a number of text processing techniques. One of
the objectives of this book is to equip its readers with knowledge that may not be
necessary to perform the translation task per se, but that can provide added value
in specific circumstances.

1.2 Why a new book on this topic?


The challenges described in the previous section require new insights and
solutions, some of which have only started to emerge. At the beginning of the
twenty-first century, seminal books on localization were published, including
Esselink (2000). While this book was relevant at the time, some of its content is
no longer up-to-date, since (i) it had a strong bias towards Microsoft Windows,
which no longer reflects the diversity of platforms used in today’s heterogeneous
app-focused world, and (ii) localization processes and strategies have changed
dramatically. For example, the ‘printed documentation’ (Esselink 2000: 12) that
used to accompany a boxed software application is now a distant memory as most
software applications are now made available via digital downloads, which may
be reached using physical cards. When software applications are still distributed
in physical boxes (on a CD, DVD), the documentation is often included as a PDF
file or set of files that Esselink (2000: 12) described as ‘online help’ (where online
meant digital). This content, which contains guidance or reference materials, is
usually accessed from the application itself by triggering a specific command (e.g.
clicking a Help button). In recent times, however, this type of content has become
harder to distinguish from the actual online content that may be made available
by the application publisher via a Web site (e.g. technical support content). For
instance, the latest versions of Microsoft Office offer users the ability to search for
information using multiple content sources, including the default documentation
content present on the user’s hard disk as well as online repositories.14
Another important volume was Savourel (2001), which focused on the
eXtensible Markup Language (XML) format for a technical audience, rather than
focusing on readers with a background in translation studies. More recently, an
edited volume on localization project management was published (Dunne and
Dunne 2011), but it did not cover the translation and text processing techniques
Introduction 9
used during the localization process. A recent ‘interdisciplinary overview of web
localization’ can also be found in (Jiménez-Crespo 2013: 1), but the main target
audience is ‘students or scholars interested in doing research in this field’, rather
than (prospective) professional practitioners. There is therefore a need for a new
book on the internalization and localization of apps and their digital ecosystem,
focusing on new developments, such as mobile devices, and addressing these new
challenges that are impacting the localization industry.

1.3 Conceptual framework and key terminology


A common question is whether another term besides translation is required.
Indeed, why is localization different from translation? The term localization is
often used to describe a process that encompasses more than the translation of
a simple digital document (say, a Microsoft Word document) using a Computer
Assisted Translation (CAT) tool such as Wordfast.15 Other linguistic (and non-
linguistic) activities are often, if not always, required to adapt an application to
the needs of people whose main language is not the same as the one in which
the product or service was originally developed. In order to make the translation
process easier, cheaper, and to some extent more sustainable, some upstream
processes are sometimes required to prepare source content. This set of processes
is often referred to as internationalization, especially in the IT industry, where
software must ideally be internationalized before it gets localized. Without this
process, localization would still be possible, but the associated cost and effort
would increase. Some downstream processes are also often necessary, such as
a linguistic quality assurance process, to ensure that the translated product or
service has not been impacted negatively by the translation process. Esselink
(2003a: 69) also explains that localization differs from translation because of
the nature of the activities (e.g. multilingual project management, terminology
management, software testing), the technology used (e.g. software translation
tools, CAT or MT) and the complexity of the projects (large projects with
multiple file formats). Since all of these activities are related, they are often
encompassed under a more generic term, globalization. Figure 1.2 shows a possible
way to represent this conceptual framework, which is going to be used in the
remainder of this volume.
As shown in Figure 1.2, localization activities are separated into three categories.
The first category concerns the translation activities that can be effectively
enhanced and supported by a myriad of translation tools (such as translation
memory and machine translation). These activities mostly focus on the translation
of textual content, such as user interface strings and user assistance content. The
second category relates to non-translation activities, such as file processing and
testing, which are necessary to glue the output of the translation activities into
target files. The third category, which is labelled as adaptation in Figure 1.2, concerns
activities that do not belong to the other two categories. Adaptation activities
include the localization of non-textual content, such as graphics or videos, and the
translation of content that requires a very high level of transformation, possibly
10 Introduction

App Globalization

I18N Localization
Strings
Content Translation Non-Translation Adaptation
Formats
Strings O perations on strin g Content
ln p u t& and content
Content F unctionality
O utput
Testing Location
Function a lity
Access

Figure 1.2 Conceptual framework for app globalization

resulting in ‘trans-creation’. This controversial term is defined by Torresi (2010: 5)


as ‘the rebuilding [of an] entire promotional text so that it sounds and reads both
natural and creative in the target language and culture’. Some will argue that this
definition applies to a step of a standard translation activity, but few will dispute
that it is almost never the case in practice as far as the translation of software strings
and user assistance content is concerned.
Additional adaptation activities may be related to the actual functionality
of an application (e.g. locale-specific resources for a spell-checker may have
to be found) or its location (in the case of a Web application). It should be
noted, however, that some of these activities may sometimes be omitted from the
localization process due to budgetary reasons. While replacing source strings with
translated strings is a process that is fairly established and somewhat predictable
from a resourcing perspective, adapting functionality and location requires a
completely different set of skills and resources.
While the categorization presented in Figure 1.2 may be debatable due to the
possible overlap between two categories (e.g. is finding relevant locale-specific
details such as a technical support contact email address a translation task or
an adaptation task?), it seems adequate to present a wide range of localization
activities. While localization is not about creating a full, new product from
scratch, it is sometimes necessary to create some parts from scratch to supplement
or replace existing translated sections, in order to meet local expectations or
comply with local laws and regulations (e.g. when producing an end-user license
agreement).

1.4 Who is this book for?


This book is mostly relevant to teachers and students of written translation
or multilingual computing courses, newly graduated translators, and even
Introduction 11
experienced freelancers. In the 2000s, Esselink (2003b: 7) predicted that
‘translators [would] be able and expected to increasingly focus on their linguistic
tasks in localization’. Since this book focuses on some technical aspects of
localization (over which freelance translators typically have little interaction
or control), one may initially wonder how translators will operationalize this
knowledge. Three main reasons can be identified.
First of all, the technology that is used during the translation process can be
quite complex and therefore difficult to master for non-technical persons. For
instance, MT and its myriad implementation details require an ability to learn
a new tool extremely quickly without necessarily impacting productivity and
quality. Translators who are comfortable updating MT systems with well-defined
linguistic assets can therefore add value to an MT-powered localization process
(which may or may not require post-editing).
Second, it is now expected that large volumes of content will be translated
and checked in limited timeframes, sometimes by having reviewers focus on the
most relevant part of this content. For instance, it may be more critical to check
the accuracy of warnings sections in installation guides rather than sections
describing applications’ use cases. Time pressures in translation delivery have
been identified as an important factor by Dunne (2011a: 120), who argues that
‘in the current market for translation and localization services, time is arguably
the most critical constraint’. The increasing use of (semi-)automated data-driven
approaches during the translation quality assurance process suggests that both
technical and linguistic skills are required to identify those content sections that
are worth spending time on. Manually reading a document from beginning to end
is no longer practical so new strategies are called for.
Third, the technical complexity of the translation process in localization
projects is exacerbated by the amount of people involved. This is becoming
particularly apparent when people with no formal translation background
or expertise are involved (e.g. crowdsourcing). If translation consistency is a
requirement, challenges can be expected when harmonizing terminology or style.
Again, effective strategies to quickly check and edit large amounts of content are
desirable.
A recent conversation on a public LinkedIn group, however, suggests that it
is difficult for some translators to find good training on technical topics related
to translation.16 This situation may be worsened by the fact that translators
rarely receive source files to work on. It may be true that in distributed, out-
sourced workflows (where multiple intermediaries exist between the client and
the translators) translators do not generally receive software resource files or
even user assistance source files. Instead, they receive project files containing
translatable text. However, it may be premature to conclude that the era of
versatile, technically savvy translators has passed, especially when translators
work directly with clients or when the content to be translated is of a technical
nature.
This book has therefore two types of audience in mind: professionals and
volunteers. It is written with accessibility in mind so it can be used as a resource
12 Introduction
for newly graduated translators who have not received specific training on the
localization of software applications and who wish to specialize in this field when
starting their professional career. It will also be useful for freelance translators
specializing in other fields, and who wish to start translating digital content
(such as software products or Web sites). Finally, other professionals working
in the field of digital content management (such as technical communicators,
app developers or program managers) might also benefit from reading this book.
While these professionals would not be responsible for translating the content
they produce or manage, they would benefit from being aware of the challenges
that have to be resolved downstream. The other audience this work concerns are
translation volunteers, specifically technically or linguistically savvy individuals
who are involved in non-profit work that requires both internationalization and
localization activities (e.g. NGOs, open-source projects). The examples chosen
in this book actually have a very strong bias towards open-source technology as a
way to give back to the overall open-source community.

1.5 Book structure


The rest of the book is divided into five main chapters and a conclusion. As is
customary in this book series, all main chapters include tasks so that readers can
actively practice what they have learnt using hands-on exercises.
Chapter 2 focuses on professional practice in order to give readers an overview
of some of the technical skills required to be well-equipped when working
in the software localization industry. To become a good domain-specialist
translator, knowing at least two natural languages and being able to translate
well between them is not sufficient. Chapter 2 addresses this gap by introducing
basic programming concepts, including text processing ones, so that translators
become more comfortable working with fragments of the programming code
that is used to write apps. Chapter 2 provides a brief description of software
development concepts, programming languages, encodings, strings, files and
regular expressions. In order to illustrate some of these concepts with examples,
the Python programming language is used. Python is a popular language, which
is often described as being easy-to-use, especially for non-programmers.17 An
important characteristic of the Python programming language, however, is that it
is currently going through significant changes. Like other programming languages
(and natural languages to some extent), the language has evolved over the last
number of years to take into account new user requirements. Such requirements
have introduced some compatibility issues that are preventing certain users from
upgrading to the latest version of the Python programming language. This means
that two versions of the language have to co-exist for the foreseeable future. In
this book, I decided to focus on version 2.x (where x corresponds to a minor
version number such as 6 or 7) instead of version 3.x. This choice is mainly
motivated by the fact that several libraries or frameworks (which are collections
of existing code functionality) only work with version 2.x. Even though the older
version is used in this book, it does not mean that the topics covered in this book
Introduction 13
will be obsolete any time soon – support for version 2.x has actually recently been
extended to 2020.18
Chapter 3 focuses on the internationalization issues and solutions listed under
I18N in Figure 1.2. According to the terminology used earlier (Pym 2004: 9i),
the localization process attempts to transform excluded locales into participative
locales rather than observational locales. Since this process might involve the
delivery of content in multiple languages, content owners must plan ahead to
ensure they can quickly cater for all their multilingual customers. This challenge,
which is commonly associated with internationalized design principles, is
addressed from three different angles in Chapter 3. The first part of this chapter
introduces concepts that are related to the creation of a global application,
by using a concrete Web application example. Section 3.2 then presents the
challenges involved during translation and quality assurance when the software
content itself is not internationalized (e.g. problems with text clippings, string
concatenation, etc.). Finally, section 3.3 examines how other content types (e.g.
user assistance content) can be internationalized as well to ease the translation
process (from a time and cost perspective). User assistance is used throughout this
book to refer to the informative, textual content that companies or developers
produce to document their products or services (including release notes, user
guides, tutorials, FAQs and technical support documentation). Poor source
quality (e.g. ungrammatical or ambiguous sentences) generates queries during
the translation process and culturally-specific content may be equally difficult to
translate (e.g. casual style, irony). Some strategies are therefore sometimes put
in place to ensure that user assistance conforms with terminological or stylistic
guidelines (Kohl 2008).
Chapter 4 introduces basic localization processes, focusing both on translation
and non-translation activities that are related to textual content (as defined
in Figure 1.2). Some of the translation challenges associated with various
content types are highlighted. For instance, specific usability issues arise when
working on mobile platforms: should abbreviations be used when translating
software strings? Software strings, which are covered first, fall into the category
of presentation content that developers produce to make their apps usable.
The second part of this chapter focuses on the translation of user assistance
content, with an extensive discussion on the role of translation guidelines and
automation.
Chapter 5 provides a break between the two localization-oriented chapters,
Chapter 4 (localization basics) and Chapter 6 (advanced localization). In this
chapter, the main discussion relates to the translation technology that is used
to support some of the activities introduced in Chapter 4, including translation
management systems, translation environments and terminology tools. An
important discussion on machine translation then follows, presenting the
differences between MT resource building (whereby an MT engine is optimized
for a subsequent, possibly indirect translation process) and MT post-editing (as a
direct translation process). Finally, strategies and standards for translation quality
assurance within localization are discussed.
14 Introduction
Chapter 6 takes up where Chapter 4 left off, by covering the third category
of localization activities, which were presented in Figure 1.2 under adaptation.
In the first section of this chapter, the adaptation of non-textual elements, such
as graphics and videos, is briefly discussed. The second section of this chapter
provides an overview of various textual transformations, involving advanced
localization or adaptation processes. This section contains a brief account of how
personalization is slightly changing the way localization is being conducted. The
third section focuses on the challenges encountered when the actual functionality
of an application has to be adapted to work in a consistent manner across various
languages (e.g. by adapting natural language resources such as lists of stopwords
required by applications such as grammar checkers or voice commands). Some of
these locale-specific challenges supplement the engineering-related requirements
listed by Giammarresi (2011: 40) (e.g. import/export methods, text wrapping,
searching, etc.). The final section of this chapter focuses on the adaptation of
an application’s location in order to address issues related to user experience and
local regulations.

1.6 What this book does not cover


Many activities can fall within the scope of globalization when a software
application or service is made available in new markets. For instance, setting up
a local support team (to help users) or a finance team (to process revenue) can
be described as global operations. In this book, however, the terms globalization
and localization are strictly restricted to activities that are directly connected to
the application and its digital ecosystem, excluding any discussion of hardware-
related issues.
This book focuses on some aspects of digital content internationalization
and localization, with a strong emphasis on textual content. Both topics are
covered from a content processing perspective rather than a project management
perspective, since the latter is covered extensively in Dunne and Dunne (2011).
Due to space constraints, all content types (especially proprietary multi-modal
formats such as Adobe Flash) cannot be covered. Also, it is not clear whether
such proprietary technology (which is currently in use to publish videos) will still
be necessary to publish or consume multi-modal content in the future, especially
with the advent of open technologies such as HTML5.19 Also, all textual genres
cannot be covered in this book. Specifically, video games will not be discussed,
since this genre is covered in Chandler et al. (2011). The translation of marketing
content is discussed in detail in Torresi (2010) so it will be only briefly mentioned
in Section 6.2. Due to the fragmented nature of the localization industry as well
as the number of technologies that are being serviced by this industry, it is worth
highlighting that this book alone is not key to professional success. It is the
author’s hope, however, that some of the concepts introduced in this book can be
applied and used effectively in some of the situations aforementioned, that have
not been covered in detail.
Introduction 15
1.7 Conventions
Several conventions have been used throughout this book. Unusual words,
terms or examples are clearly marked with the use of italics. Characters or phrases
that have a specific meaning in a programming context (say, in Python code)
are identified with the use of bold. Links to various resources (such as tools
or specific articles) are provided in endnotes. Due to the large number of links
provided, the last access date (15 July 2014) applies to all links. A basic Web site
was also set up to act as a companion to this book (e.g. to provide a list of errata
and links to various resources, including the code snippets used in this book, so
that unnecessary typing is avoided).20, 21

Notes
1 http://www.gnu.org/software/gettext/manual/gettext.html#Concepts
2 http://www.commonsenseadvisory.com/AbstractView.aspx?ArticleID=1416
3 http://www.cipherion.com/en/news/243-more-irish-hotels-catering-for-non-english-
speaking-tourists
4 http://www.libreoffice.org/community/localization/
5 http://bit.ly/x3NmJH
6 http://www.culturalpolicies.net/web/ireland.php?aid=519
7 http://www.culturalpolicies.net/web/germany.php?aid=518
8 http://1.usa.gov/1wzTgsX
9 http://www.oscca.gov.cn/index.htm
10 http://nerds.airbnb.com/launching-airbnb-jp/
11 http://www.telegraph.co.uk/technology/apple/9039008/Apple-iPad-outselling-HP-
PCs.html
12 http://www.gartner.com/newsroom/id/2623415
13 http://translate.twttr.com/welcome
14 http://support.microsoft.com/
15 http://www.wordfast.net/
16 http://www.linkedin.com/groups/Why-is-so-difficult-find-44105.S.42456766
17 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
18 http://hg.python.org/peps/rev/76d43e52d978
19 A recent announcement by Adobe in fact confirmed it is stopping the development of
its Flash Player plug-in for mobile devices, since the alternative HTML5 technology
is universally supported: http://blogs.adobe.com/conversations/2011/11/flash-focus.
html
20 The code (including commands) is provided ‘as is’, without warranty of any kind,
express or implied, including but not limited to the warranties of merchantability,
fitness for a particular purpose and non-infringement. In no event shall the authors
or copyright holders be liable for any claim, damages or other liability, whether in an
action of contract, tort or otherwise, arising from, out of or in connection with the
code or the use or other dealings in the code.
21 http://localizingapps.com
2 Programming basics

Linguistic skills alone are not sufficient to be a good translator. Domain


expertise is almost as important, which is why successful translators are often
people who have worked in a specific industry before deciding to change career.
As far as software (application) localization is concerned in the Information
and Communications Technology (ICT) sector, excellent technical skills are
required. These skills include both standard computing skills (being able to
operate in multiple environments or being able to adapt very quickly to a new
set of features provided by a translation tool) and automated text processing
skills (being able to manipulate large quantities of digital text rapidly).Without
a passion for technology, it would probably be very difficult to have a successful
career as a professional translator in the software localization industry. This is
due to two main reasons: first of all, the content to translate is often technical,
so some familiarity or expertise with the domain in question is required. Second,
the tools and processes used in this industry change at a frenetic pace. This
means that translators need to stay informed of the latest developments in such
tools and must be ready to switch from one version to another if it is required
for a specific job. Examples of translation-related tasks include being able to
master a new online translation memory application, revising texts that have
been produced by a community of non-professional translators, or customizing
a machine translation system for a specific content type. In short, the modern
translator in the localization industry must be ready to adapt extremely quickly
to new situations.
This chapter is divided into five sections whose objectives are to contribute
towards the acquisition of some of the technical skills mentioned previously.
The first section is a brief overview of the trends that are currently affecting
the development of software applications (or apps). This section is followed by
a brief introduction to programming languages, encodings, software strings and
files in order to equip the reader with the minimal technical knowledge required
to feel at ease when working on the localization of an application’s components.
In these sections the focus is on the Python programming language. While basic
concepts of the Python programming language are introduced in this chapter,
the level of content may not be sufficient to learn the language from scratch. It
is therefore recommended to use an online learning environment (such as the
Programming basics 17
Python track of Codeacademy) to complement the reading here.1 The last section
of this chapter introduces advanced text manipulation techniques using regular
expressions, which can be extremely useful when dealing with large volumes of
digital, textual files. Most of the examples provided will involve working with a
command-line environment, which is very different from a mouse-based graphical
environment. Readers who have never used such an environment before may
want to familiarize themselves with an online tutorial first.2, 3

2.1 Software development trends


One of the current trends in the software industry is the fact that desktop-based
applications are now increasingly being replaced by cloud-based services which
can be accessed by client applications such as Web browsers. These cloud-based
services rely on a software delivery model where software and associated data
are centrally hosted on remote computers (servers), known as the cloud. While
hosted mail services, such as Yahoo! Mail, have been available for many years,
scalable collaborative document editing or sharing services such as Google
Docs or Dropbox are more recent and compete directly with traditional desktop
products such as Microsoft Office which have traditionally been the reason
why people bought PCs in the first place. This trend is influencing the way
applications are being made available to end-users. In the past applications used
to be released on specific dates (say, every year) following a waterfall development
model. This model was based on a sequential process, whereby progress was based
on multiple phases, including requirements gathering, design, implementation,
checking, release and support. This meant that end-users had to install new
versions whenever a major release was made. Increasingly, however, Web-based
applications are being updated much more frequently in an incremental manner,
sometimes several times daily, which means that end-users do not have to worry
about installing new versions. With this iterative and incremental model, they
always benefit from the latest features, which may not have gone through a full
verification process. Software developers can, however, learn from usage and
improve features in the next cycle. Dunne (2011b: 116) defines the iterative model
as ‘a series of iterations, or short development cycles, each of which [building]
on an incomplete solution and [bringing] it a step closer to being a complete
solution’. This iterative model is part of many software development frameworks,
such as Agile Development, which is built on a flexible response to change. Agile
software development encompasses a group of software development methods
based on iterative and incremental development, where requirements and
solutions evolve extremely rapidly. The evolution of such software development
practices has had an impact on localization and translation practices, both from
a planning and resourcing perspective.
Another trend is that software publishers now also have to target a variety
of platforms, whereas in the past they may have focused on one or two. While
Windows-based Personal Computers (PCs) used to be the dominant consumer
computing environment in the 2000s, new environments have recently become
18 Programming basics
popular, especially on mobile devices (such as tablets or smartphones). The
fragmentation of the market means that new operating systems, such as Android
and Apple’s iOS have changed a landscape that used to be dominated by Microsoft
when it came to commercial software. One solution to this problem is to develop
a Web site (or Web application) which can be accessed by any Web browser.
This means that regardless of the type of operating system used on the desktop
(Windows, Linux or (Mac) OSX) or on a mobile device (Android, Windows
Phone, iOS), users should be able to access this site or application using a Web
browser of their choice. Of course, differences exist between browsers so pages will
not always render in the same way, and some users may not be able to access all
functionalities or information of a given Web application. To work around these
issues, software publishers have to create native applications, which are optimized
for specific platforms. Targeting multiple platforms means development costs will
increase, so a staged approach is often used: one platform is targeted first and
others are added afterwards. In a way, this process is similar to the way localization
is conducted. A program may first be released in a single language, and once it has
proved popular or successful, other languages may be added.
As far as software development is concerned, projects tend to use at least
one programming language to create an application that encapsulates all its
functionalities. Web-based projects also rely heavily on a markup language, such
as the Hypertext Markup Language (HTML), to create the visual interface that
the end-user is presented with to consume content (e.g. Web site of an online
newspaper) or achieve something (e.g. communicate with a friend using a chat
application). More often than not, these applications contain pieces of text
(known as text strings) that allow users to navigate from one page to another.
While software publishers can offer any interface to their end-users, they often
tend to rely on existing concepts and best practices to make sure their users will not
be (too) confused when using the application for the first time. This is obviously
even more important these days, when it is so easy to remove applications and
install another one if the first impression is not positive.

2.2 Programming languages


Existing programming concepts and practices vary from one operating system
to another (say from Windows to Linux) and this variance is reflected in the
way similar names are being given to slightly different concepts. Naming is of
course extremely important, since these textual strings guide users through an
application, so it is worth pausing for a moment to reflect on the origin of some
of these names. As mentioned earlier, programming languages are employed to
create software applications, and these languages are used to create functionality
within the application. The languages are used to write pieces of code that will
be executed – or put into motion – by the computer when the user clicks on
specific items in a graphical environment or enters specific commands in a
command-line environment. Programming languages rely on a set of pre-defined
keywords but programmers (also known as developers) are often free to come up
Programming basics 19
with any name to keep track of the functionality they create. For example, a set
of programming commands is often grouped together in a function (also known
as a method), which must be named. Programmers tend to pick names that are
self-explanatory for such functions, and these (short) names are sometimes kept
to label the menus, windows or buttons in the interface. Once these names are
used in the interface, they may have to be documented (in a readme file, in a
help file or in a screencast), so their usage becomes more frequent, and they may
end up becoming general words in the long term. For example, one can think of
the verb debug, which may have originated from a function called debug designed
to find problems (or bugs) with a specific piece of code, before becoming a word
commonly used to refer to the process of finding problems with any program.
There are a lot of programming languages in existence, and new ones are
probably being created as you read this. Like natural languages, it is obviously
impossible for any individual to master all of them, but it is possible to learn key
concepts and be able to get some understanding of what the code is supposed
to achieve. From a localization perspective, this myriad of languages (and their
individual characteristics) means that both tools and humans must be able to
handle those elements which require linguistic adaptation. In order to understand
better this last statement, it may be useful to examine the code snippet from
Listing 2.1 to understand better what a simple program is (from the programmer’s
point of view) and clarify along the way which of its elements require linguistic
adaptation. This example is written using the Python programming language,
which was selected based on reasons mentioned in Section 1.5.
In the example provided in Listing 2.1, each line contains an instruction
which will be interpreted in sequence by the Python interpreter. The Python
interpreter is the program that ultimately transforms these human-readable
statements into binary code, which is used to represent computer processor
instructions using the binary number system’s two digits (0 and 1). If we look
at these three lines of Python code, we realize that actually most of these words
are English words. Obviously some of these words are being used with a meaning
that is different from their standard dictionary meaning: for instance, it is highly
unlikely on the first line that re actually refers to a musical note. Instead, it
is an abbreviation for regular expressions, which is a set of text manipulation
techniques that will be introduced in Section 2.6. The first line specifies that the
functionality that allows for regular expressions to be used must be imported into
the current program. The word import is actually a reserved Python keyword,
which means that it will have a special and consistent meaning across all Python
programs. The name name on the second line is not a Python reserved keyword,
but this identifier (also known as a variable) is used here to refer to a data object

1 import re
2name = "Johann"
3print "Hello from " + name #print text to standard output

Listing 2.1 Example of Python code from the developer’s perspective


20 Programming basics
of some type. Gauld (2000) uses the following analogy to define a variable: ‘Data
is stored in the memory of your computer. You can liken this to the big wall full
of boxes used in mail rooms to sort the mail. You can put a letter in any box but
unless the boxes are labelled with the destination address it’s pretty meaningless.
Variables are the labels on the boxes in your computer’s memory.’ In the same
way that functions should be given meaningful names, variable names should be
chosen carefully. Using short names, such as individual letters (e.g. s or t instead
of substitution_name or timing), may save some initial typing time but it is likely to
create understanding problems when the code is read by a different person (or by
the same person a few months later).
In some programming languages, every identifier used has type information
declared for it (indicating whether it contains a string, like “Johann” or a number
like 3). This is not the case with languages such as Python or Perl, which means
that their programs are more compact. As a convention, strings in Python are
defined as sequences of characters that start and end with a single or double
quote. When used in this context these quote characters have a special meaning
since they allow the interpreter to identify where a string begins and ends. Finally
the third line contains a statement that instructs the program executing this piece
of code to print a message. In this context, the reserved keyword print does not
refer to a printer, but to the standard output channel used by the program to write
its output data. In command-line programs, this channel is usually a text terminal
(computer console) or a file.
From a linguistic adaptation perspective, some elements are translatable
(because they are destined to be read by humans who may prefer to read and use
a language other than the source language, e.g. English) while others are non-
translatable (because they are destined to be interpreted by a program, which only
understands a limited set of instructions). For example, the elements in the first
line are non-translatable because this line instructs the program to use specific
resources. On the second line only one element is translatable, the string Johann.
Obviously this is a special case because this string refers to a proper name, which
may be left untranslated during the translation process. But if the name variable
was changed from Johann to world, then this value would most likely have to be
translated. The third line contains a mix of non-translatable elements (print,
+ name) and translatable elements (with the string “Hello from”). The final part
of the line (after the # character) is a comment that explains what the code
does. Such comments are often used by programmers either to (i) inform other
programmers about a certain piece of functionality (especially in large projects
involving multiple programmers) or (ii) leave notes for themselves so that they
can remember what the code does a few months (or weeks, or even days) later.
From a localization perspective, these comments can be either considered as
translatable or non-translatable depending on how the code is going to be used.
For instance, if this piece of code was provided as an example to a customer
to showcase a particular functionality, it might be advantageous to have the
comment translated (depending on the customer’s linguistic preferences). On the
other hand, if this piece of code was released in a compiled form to a customer,
Programming basics 21
translating the comment would not make much sense because the comment
would never be seen. If all of this sounds a bit complicated, do not worry since the
tasks at the end of this chapter will show you the steps required to run your first
Python program. The same comment applies to some of the examples provided in
the next sections. While they might be hard to follow in places for readers with
limited technical skills, they will be clarified by the end-of-chapter tasks.

2.3 Encodings
This semi-technical section is divided into two parts. The first one provides a
general overview of encodings, including a discussion on popular encoding
formats. The second part provides some hands-on examples on how to deal with
encodings using the Python programming language.

2.3.1 Overview
The previous section touched on key programming concepts, including
statements, variables and strings using as an example a high level programming
language, Python. In order to tackle more complex concepts, such as text
file manipulation, some clarification must be provided around the concept of
encoding. According to Wikipedia, an encoding ‘consists of a code that pairs
each character from a given repertoire with something else, such as a bit pattern
(…) in order to facilitate the transmission of data (generally numbers or text)
through telecommunication networks or for data storage’.4 In the very simple
example provided earlier in Listing 2.1, the data used were already present in the
program itself (e.g. “Johann”). Most of the time, however, the data that should
be manipulated comes from external sources, such as files. In these cases, it is
important to know what the encoding of these files is in order to process the
data accurately. This task may sound trivial because most programs (such as word
processing programs or text editors) often guess the encoding of files when they
open them. But when you are working with a programming language, you often
have to specify which encoding should be used. Before presenting how encoding
works in the Python programming language, the next section provides additional
background information on the concept of encoding, based on content found in
two comprehensive online resources.5, 6
Encodings must be understood in order to avoid character corruption issues
in localization projects. Such issues can occur when the original program does
not accommodate encodings other than the one used in the source language.
In such cases, there is very little a translator can do. However, an issue can also
occur when files are manipulated by a large number of stakeholders, including
people (such as translators) and systems. If a file is saved in an encoding that
differs from what is specified in localization guidelines, problems may occur later
on in the localization workflow. When dealing with multilingual text, problems
related to encodings must be addressed. As mentioned in the previous section,
computers only understand series of bits (1 or 0). These bits are grouped in bytes,
22 Programming basics
a byte being a group of precisely 8 bits used to encode a single character of text
in a computer. Most humans only understand a few natural languages, which
consist of a number of characters, possibly using a number of alphabets. For
instance, Japanese speakers will be familiar with ideograms (Kanjis), but will also
rely on phonetic syllabaries (such as Katakana and Hiragana) to express certain
words (such as loan words or function words). Foreign words (such as English
words) may also sometimes occur in the middle of Japanese text, so all of these
characters must be representable in a common format so that information can
be smoothly exchanged between a computer in Japan and another computer,
say, in Germany. These days, the Unicode standard allows for the exchange of
such multilingual information using a number of encodings.7 For example, the
Universal Character Set Transformation Format 8-bit (UTF-8) encoding is now
the preferred encoding for Web pages.8
The situation was, however, very different years ago, when computers were not
networked (and thus encoding incompatibilities were far less frequent). In order to
translate bytes (which do not have any meaning by themselves) into characters, a
convention is required. For example, the alphabet used by the English language relies
on a limited number of characters, which, for many years could be encoded using
a small, compact code called ASCII (American Standard Code for Information
Interchange). This code assigns a single byte to a specific character (for example,
66 for the upper case letter B). Similar codes existed for other languages, but each
was only good for representing one small slice of human language. For example,
8859-1 offered full coverage for languages such as German or Swedish but only
partial coverage for French (since characters such as œ were missing).
Besides, while this approach works well for languages that rely on a limited
number of characters (fewer than 256), it fails for those that require thousands
of characters (such as Chinese and Japanese). In Japan and China, this problem
was solved by the DBCS system (the double byte character set) in which some
letters were stored in one byte and others took two. These DBCS encodings then
evolved into multi-byte character sets (such as Shift-JIS, GB2312 and Big5)
which fall outside of the Unicode code page. The latter contains more than
65,536 possible characters.
In Unicode, a letter maps to a code point, which is a theoretical concept. For
every alphabet, the Unicode consortium assigns every letter a special number of
the form U+0639. This special number is called a code point and Unicode has
capacity for 1.1 million code points. While 110,000 of these are already assigned,
there is room to handle future growth. These code points must, however, be
encoded into bytes to be understood by computers. Multiple encodings exist,
including the traditional two-byte method called UCS-2 (because it has two
bytes) or UTF-16 (because it has 16 bits). A third encoding is the popular new
UTF-8 standard, which was mentioned earlier. This encoding is popular because
the Unicode code points can also be encoded in legacy encoding schemes, but
with the following caveat: some of the letters might disappear. If there is no
equivalent for the Unicode code point in the target encoding, a question mark
? may appear instead.
Programming basics 23

Figure 2.1 Creating a Unicode string in Python

2.3.2 Dealing with encodings using Python


Now that the concept of encoding has been introduced, this section presents
strategies on how to deal with encodings using the Python programming
language. In this section, most examples can be typed (or copied and pasted)
and executed using an environment such as the PythonAnywhere online
environment, which is described in more detail in Section ‘Setting up a remote
working Python environment’.9 If you are interested in reproducing (some of)
the examples provided in this section, you may decide to take a look at this task
before continuing any further. Let’s start with an example using the Japanese text
shown in Figure 2.1.10
In the 2.x series of Python (which is used throughout this book), two different
string data types exist. A plain string literal produces a str object, which stores
bytes. Using the u prefix, however, produces a unicode object, which stores
code points. The construct \u can be used to insert any Unicode code point in a
Unicode string, as shown on the first line of Figure 2.1.
In this example, a Unicode string containing five 5 code points is created
and stored in a variable called unicode_string. When this string is printed on
screen with the print statement on the second line, the output is the word
corresponding to Thank you in Japanese. It is also possible to check the length
of this object by using the len function on the following line. This function
produces 5 as expected because there are five actual characters (code points)
in the string. Let’s now take a look at a slightly different example, this time by
reading the contents of a file containing the same text. This time, we create a
text file containing this Japanese word and we save it as myfile in UTF-8 format
(to make sure the Japanese characters are saved properly), as shown in Figure 2.2.
Let’s now turn to the Python code that can be used to read the content of
the file we have just created. Listing 2.2 contains the statements that allow us to
achieve this objective. The format of this code snippet, which is going to be used
extensively throughout this book, is slightly different from the one presented in
Figure 2.1. Listing 2.2 contains a mix of comments, statements (preceded by a
commented line saying “# In”) and output produced by statements (preceded
by a commented line saying “# Out”). This code snippet performs the following
steps: after navigating to the directory where the file is located using the chdir
function of the os module on line 7, the content of the file is read and stored
in a variable on line 9. In this example, the directory (or folder) navigation is
24 Programming basics

Figure 2.2 Saving a file as UTF-8 in a text editor

1# ### Reading content from a file


2# In:
3file_path = r"/home/johann/Desktop" #Adjust as required (e.g. r"C:\documents")
4# In:
5 import os
6# In:
7 o s .chdir(file.path)
8# In:
9my_string = open("myf ile.txt", "rb") .read() .rstripO
10# In:
11print len(my_string)
12# Out:
13# 15

Listing 2.2 Reading the content of a file using Python 2.x

achieved using a command instead of using a graphical file navigator (requiring


multiple clicks).
New language constructs are also introduced on line 9. While the open()
function is quite self-explanatory (i.e. a file called myfile.txt is being opened in
read mode thanks to the rb option), two extra items have been added after the
open function: read() and rstrip(). These extra items also use brackets and
are separated by dots. One difference between these specific items and the open()
Programming basics 25
function is that they are being used in this example without any argument (or
input). While the open() function expects a file name as input, the read() and
rstrip() are used without any input, which means that their default behaviour
will apply. For instance, the in-built locals() function does not expect any
input to return information about the state of various variables. The read() and
rstrip() items are called methods and they can be used on specific objects, for
example string objects, and combined in a sequential fashion. It would have been
perfectly acceptable to use two separate lines to achieve a similar result, but these
methods allow us to chain the execution of code. In other words, this means that
we first open the file, second that we read its content into a string and then that
we remove any white space character (such as a new line character) that may be
present at the very end of the file.
Let’s now focus on the output of the program which is generated by issuing
a print statement on line 11 of Listing 2.2. Instead of returning 5 as in the
previous example, the output of the command is now 15! This shows that our
command does not treat the content of the file as code points, but rather as a
series of 15 bytes. 15 is not an intuitive result because UTF-8 is a variable-width
encoding, which means that two words having the same number of characters may
be encoded with a different number of bytes. In order to start working with the
content of the file in a safe manner, some decoding is therefore necessary in order
to avoid making wrong assumptions. For example, if we were tasked to design a
program that extracts the second character of each Japanese word contained in a
text file, we could easily extract wrong or even meaningless information. Instead,
we will use specific methods that are being made available to us by the Python
language. So in order to deal with a byte string object, we can use the decode
method to turn it into a series of Unicode code points as shown in Listing 2.3.
This time the statement on line 5 produces the expected output: 5. This
simple example shows that it is extremely important to know the encoding of a
particular file in order to decode its content accurately. Another way to decode
the content of a file is to use another Python 2 function, contained in the codecs

1# ### Decoding strings


2# In:
3my_unicode_string = my_string.decode("utf-8")
4# In:
5print len(my_unicode_string)
6# Out:
7# 5
8 # In:
9 import codecs
10# In:
llmy_new_string = codecs.openCmyfile.txt", "rb", "utf-8") .readO .rstripO
12# In:
13print len(my_new_string)
14# Out:
15# 5

Listing 2.3 Decoding the content of a file into a Unicode string using Python 2.x
26 Programming basics
module. To access this module, it must be first imported, as shown on line 9. The
next statement (on line 11) is very similar to the one used on line 9 in Listing
2.2. This time, however, the file is opened using a specified encoding (UTF-8) so
that the decoding is done at the same time. If we check the length of the resulting
object on line 20, 5 is obtained again, showing that both approaches generate the
same result.
In the example in Listing 2.3, we have assumed that the encoding of the file
was UTF-8 (because this is the encoding we used when saving the file). However,
it would be very easy to come across an encoding problem if we tried to use the
wrong encoding when opening the file (say, UTF-16).

2.4 Software strings


We have introduced the concept of software strings in the previous section, but
we now need to explain in more detail what strings are and what purpose they
serve in programming. In the Python programming language, ‘strings are used to
record textual information as well as arbitrary collections of bytes’ (Lutz 2009:
89) in a sequential manner. For example, the string this is a string is a sequence
of individual character strings, starting with the character t and finishing with
the character g (from left-to-right). Since a string is a sequence of individual
components, these components can be accessed by position. For instance if we
wanted to access the first character of the string, we can use the statement on
line 3, which produces t, as shown in Listing 2.4.
Checking the first character of a string can be extremely useful when processing
a text file that contains one sentence per line. When dealing with such files, it
is sometimes useful to filter lines based on their starting character. For example,
if we were interested in extracting all the lines starting with a t, we could use the
approach presented here. Note that positions start at 0, not 1. In a way this is like
floors, which may differ from one country to another. For example, in European
countries the ground floor is 0 while in North America it is 1. It is also possible to
define longer sequences in order to extract multiple characters. For instance when
we want to extract the first word of the string, we can use the statement shown on
line 3, which produces this. Note that the end character specified (4) is actually
not included in the result, which means that the sub-string will contain characters

1 # ### Manipulating strings


2# In:
3print "this is a string"[0]
4# Gut:
5# t
6# In:
7print "this is a string"[0:4]
8# Out:
9# this

Listing 2.4 Selecting specific characters from a string using their position
Programming basics 27
l#Small game program asking a user to find a random number
2
3#Tell program to import the "random" module
4 import random
5
6#Generate a random number between 0 and 5
7 secret.number = random.randint(0,5)
8
9#Question to the user
lOquestion = "Guess the number between 0 and 5 and press Enter."
11
12while int(raw_input(question).s t r i p O ) != secret.number:
13 pass
14#Tell user that they have won the game
15print "You’ve found it! Congratulations"

Listing 2.5 Secret game program: the developer’s view

up to the fifth character of the original string (but not including it). This may be a
bit confusing at first, especially since the first character has an index of 0, but this
is something that becomes easier to remember with a bit of practice.
Strings can be used in a number of contexts, not only to record textual
information for processing, but also to help the users of a program interact with
the program itself. For example, let’s look at the content of a small program called
secret.py, shown in Listing 2.5.
This program is very simple and, as mentioned previously, utilizes programmer’s
comments in the lines starting with the # symbol. For instance the first line
tells us that this is a game program that asks a user to find a random number.
The first line of code (statement) is actually on the fourth line, with the import
of functionality providing mechanisms to generate random numbers. The next
statement is on line 7 where a random number between 0 and 5 is generated. The
subsequent statement is on line 10 where a string is used. This string is going to
be used in the question that will be presented to the user of the program. The
next part of the program (line 12) is the core of the program. Computers are
very good at repetitions, so multiple statements are sometimes grouped together
in a sequence that is specified once but that may be executed several times in
succession. Such a sequence is known as a loop and this program uses a while
loop. This loop performs several steps:

1 It presents the question to the user and collects the answer.


2 It removes any unnecessary character that may have been added when the
user confirmed their answer by pressing the Enter key.
3 It converts the answer to a number using the int function.
4 It compares the provided number with the secret number (that was generated
earlier).

The fifth step has two possible outcomes: if the given answer does not match
the secret number (the comparison being made with the != operator), line 13 is
28 Programming basics
$ python secret.py
Guess the number between 0 and 5 and press Enter.2
Guess the number between 0 and 5 and press Enter.5
You’ve found it! Congratulations

Listing 2.6 Secret game program: the user’s view

executed and the program passes. This means that the loop will return to the first
step and present the user with the question again. However, if there is a match,
the loop will be exited and the next line will be executed. In this case, success
will be achieved and the user will be notified. The second string of this program
occurs on line 15 as part of the print statement that lets the user know that they
have won the game. An example of what the user will see when playing the game
is shown in Listing 2.6.
Multiple lines are present in Listing 2.6. The first line, which starts with a dollar
sign character ($) corresponds to the command prompt followed by the command
that is used to execute the program. More information on command-line prompts
is provided in Section ‘Setting up a local working Python environment’. The
next two lines correspond to text that was shown to the user (starting with Guess
and finishing with Enter.) and user input (in this case, 2 and 5). In this example,
the user found the secret number after two attempts. When the program was first
run, the user was presented with the question and the answer they typed was
2. This did not match the secret number so the while loop was run again and
the question was posed again. The second time the answer given was 5, which
happened to match the secret number. The loop was therefore exited and the user
was greeted with a congratulatory message.
This simple program works fine, but there are a few modifications that can be
made in order to make it more flexible and easier to maintain in the future. These
modifications will help us introduce an important topic in programming and in
localization: the combination (or concatenation) of strings.

2.4.1 Concatenating strings


In Listing 2.7, a new string is introduced on line 7. This new string is used to
ask the user to select a maximum number instead of using a default, hard-coded
maximum number 5 as in the first version of our program.
According to Wikipedia, ‘hard-coding refers to the software development
practice of embedding what may, perhaps only in retrospect, be regarded as input
or configuration data directly into the source code of a program or other executable
object, or fixed formatting of the data, instead of obtaining that data from external
sources.’11 In short, values that are bound to change over time (possibly frequently)
are sometimes hard-coded into a program, whereas they should be obtained from
the user or placed in a file (such as a configuration file) that can be easily edited
without necessarily having to update the program itself. This problem can have
serious consequences from a maintenance perspective because what seems like
Programming basics 29
l#Small game program asking a user to find a random number
2
3#Tell program to import the "random" module
4 import random
5
6#Generate a random number between 0 and a number selected by the user
7number_selection = "Select a maximum number:"
8max_number = int (raw.input (number_selection) .rstripO)
9secret_number = random.randint(0,max_number)
10
ll#Question to the user
12question = "Guess the number between 0 and 7«d and press Enter." 7« max.number
13
14while int(raw_input(question).s t r i p O ) != secret.number:
15 pass
16#Tell user that they have won the game
17print "You’ve found it! Congratulations"

Listing 2.7 Revised secret game program

an innocuous update (such as changing a word in a string or changing a range of


numbers) may require a lot of development work, especially when these values
are repeated multiple times across a program. To work around this problem a user-
defined number is used in the revised version of our program, which also makes
the program more interactive, and possibly more interesting.
Adding this new feature to the program, however, has an impact on the
second string (contained in the question variable on line 12). Instead of having
between 0 and 5, this string now contains what is called a substitution marker or
placeholder (%d) which is going to be used to change the original string depending
on the user selection. Indeed, the “Guess the number between 0 and %d and press
Enter.” string is followed by another element %max_number, which instructs the
program to format the string by replacing %d with whatever number has been
selected by the user. In this context, % is a substitution operator which can be
used to replace a number of substitution markers (such as %d) with an integer,
such as the one stored in max_number. When looking at the program from the
user’s perspective, Listing 2.8 shows how the substitution marker was replaced by
the number 10 specified by the user.
As you may have guessed already, we could now easily modify the program
to ask the user to also select a minimum number, instead of using 0 as a default.

$ python secret2.py
Select a maximum number:10
Guess the number between 0 and 10 and press Enter.2
Guess the number between 0 and 10 and press Enter.5
Guess the number between 0 and 10 and press Enter.8
Guess the number between 0 and 10 and press Enter.1
You’ve found it! Congratulations

Listing 2.8 Secret game program: the user’s view


30 Programming basics
After making the necessary changes to record this selection in the program, the
question variable would look something like this:

question = "Guess the number between %d and %d and press Enter."\


% (min_number, max_number)

This simple example showed that by introducing flexibility in programs (instead


of relying on hard-coded settings), the readability of strings is impacted from a
linguistic perspective because it is not obvious to a person with no programming
background what %d represents. And yet this type of string would have to be
translated during the localization process. Additional guidance on how to
deal with such constructs will be provided in Section 4.2.2. Language-specific
markers are often used in programming languages to concatenate strings at run-
time (i.e. when the program gets executed). When too many markers are used,
or when these markers are not given self-explanatory names, it is sometimes
difficult to understand what the strings are supposed to mean. Take a few
moments to consider the example in Listing 2.9. What do you believe the value
of my_string is?
The answer is Substitution is fun because the two substitution markers (%s)
get replaced by the values of the f and g variables respectively. These values
happen to be Substitution and is, as defined on lines 2 and 1 respectively. For
the sake of completeness, it is worth mentioning that other string formatting
options exist in the Python programming language, so it is worth reviewing them
because such constructs may appear in localization projects. Rather than using
raw markers such as %s, it is possible to use descriptive names in order to increase
a program’s readability, as shown in Listing 2.10. This example differs from the
previous one on line 3 because the markers are more verbose but the value of my_
string remains Substitution is fun. Instead of having %s, %(topic)s and %(copula)
s are used. These markers are then replaced with values contained in the object
following the % character on line 3. This object is known as a dictionary because
it contains any number of key-value pairs. From a translation perspective, this
type of presentation is useful because the string “%(topic)s %(copula)s fun” is
more meaningful than “%s %s fun” (even without having access to the values
of f and g). While the actual words topic and copula would remain untranslated

lg = "is"
2f = "Substitution"
3my_string = "#
/«s #
/0s fun" #
/« (f, g)

Listing 2.9 Example of string formatting

lg = "is"
2f = "Substitution"
3my_string = "#
/,(topic)s #
/,(copula)s fun" °/0 {"topic": f, "copula": g>

Listing 2.10 Second example of string formatting


Programming basics 31
lg = "is"
2f = "Substitution"
•3my_string = "{0} {1} fun".format(f,g)
4my_string2 = "{topic} {copula} fun".format(topic=f, copula=g)

Listing 2.11 Final example of string formatting

in translated strings, they should allow a translator to determine whether a word


order change is required in the translation.
Another way to format a string is to do away with markers altogether, as
shown in Listing 2.11. In this final example, two options are presented with the
second one being more verbose and descriptive than the first one. As it will be
discussed in further detail in Section 3.2.4, the second option is preferable from a
translation perspective because it provides more information about the meaning
of the words that may have to be substituted in the translation. Relying solely on
0 and 1 can be challenging to determine how these items should be re-ordered
in the translation.
At this point, it may be useful to clarify some issues relating to marker types
and formatting types. In the first examples we introduced, the markers used were
%d, whereas in the example from Listing 2.9 they are %s. There are a number of
different markers available due to the fact that Python objects can have different
data types. So far we have focused mainly on strings, but other data types such as
floats (i.e. numbers that can be written with a decimal component) also exist. In
the Python language, it is not always possible to combine two objects that are of
different types. As shown in Listing 2.12, the statement on the first line generates
an error (TypeError) whereas the statement on line 5 succeeds. When the error
is encountered, the interpreter explains that string (str) and integer (int)
objects cannot be concatenated (or combined).12 For the combination to work as
expected, the integer 3 must be expressed as a string, with single or double quotes
around it.
Actually, this is a problem we had to address in our initial program (secret.py
in Listing 2.5) with the following construct:

int(raw_input(question).strip())

On this line, the raw_input() function is used to capture information from


the user. The information entered is captured as a string so we had to perform

» > print "My favorite number is " + 3


Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot concatenate ’str* and ’int’ objects
» > print "My favorite number is " + "3"
My favorite number is 3

Listing 2.12 Combining objects in Python


32 Programming basics
two tasks before being able to use the user input: First we had to remove any
potential new line introduced by pressing Enter. This is done by using the
strip() method. Then, we had to convert this information to an integer
with the int() function. Once these two steps are completed, it is possible
to compare the input with the secret number to check whether there is a
match.

2.4.2 Special character in strings


The example introduced in Listing 2.7 showed that the user input was captured
on the same line as the segment prompting the user to enter a number. From a
display perspective, it is possible to present the information in a different way to
the user. For example, it is possible to capture the user input on the line below the
instruction as shown in Listing 2.13.
In order to accomplish this, the string containing the instruction would
need to be modified so that a new line is inserted. In the Python programming
language (and in other languages), new lines are controlled using the \n escape
sequence:

question = "Guess the number between %d and %d and press Enter.\n"\


% (min_number, max_number)

An escape sequence contains characters whose meaning changes because they


are preceded by an escape character, which, in the case of Python is a backslash.
When \n is used in a Python string, it does mean that the string should display
a backslash character and a letter n after press Enter. Instead this sequence
instructs the interpreter to insert a new line after the last character of the
string. Other common escape sequences include \t for a tab character and \r
for a carriage return. The backslash character is also used to escape characters
to force them to regain their original meaning. In Section 2.2 it was mentioned
that strings in Python must start and end with a single or double character.
However, it is sometimes desirable to display such characters in strings. The

$ python secret2a.py
Select
Si a maximum number:10
Gi
Guess the number between 0 and 10 and press Enter.
2
Gi
Guess the number between 0 and 10 and press Enter.
5
Guess the
Gi number between 0 and 10 and press Enter.
8
Guess the
Gi number between 0 and 10 and press Enter.
1
Yc
You’ve found it! Congratulations

Listing 2.13 Secret game program: another user view


Programming basics 33
backslash character can be used for that purpose where actual double quote
characters are inserted around numbers:

question = "Guess the number between \"%d\" and \"%d\" and press
  Enter.\n" \
% (min_number, max_number)

If the user specified 0 and 10 as input numbers, the string would appear as follows
when the program executes the statement from the while loop:

Guess the number between "0" and "10" and press Enter.

From a translation perspective, it is obviously important to be aware of such


escape sequences so that they can be preserved (or adapted carefully) during
the translation process. More discussion on this topic will be provided in
Section 4.2.2.

2.5 Files
Translators working in the localization industry must have an advanced
understanding of file formats. For example, the previous section focused on the
files used by the Python programming language, files usually ending with a .py
extension. It is, however, unlikely that translators will be given such files to
translate directly. As we will see in Section 3.2.3, files containing source code
are usually analysed by a program in order to extract translatable resources.
Such resources are then made available to translators in a container, which is
sometimes referred to as the translation kit (or transkit). This translation kit can
be passed by the client to a number of stakeholders, including language service
providers and translators. This transkit should ideally contain translatable strings,
but also any resources to be used during the translation process, such as glossaries,
translation memory matches, possibly machine-translation suggestions, and
translation guidelines. Depending on the amount of information and content
they contain, translation kits can vary in nature: some of them may be made
available to translators via an online application; others may be encapsulated
in a proprietary file format that can only be opened by a proprietary desktop
application. Finally, some projects may be encapsulated in an open format, such
as the Portable Object (PO) or XLIFF formats, which are discussed in the two
following sections.

2.5.1 PO
The PO format originates from the open-source GNU gettext project, which has
been used extensively to localize multiple applications making use of programming
languages such as C, PHP or Python (often in a Linux environment).13 PO files,
which are known as catalog files or catalogs, are text files that can be edited
34 Programming basics

blank line
# comments-by-translators
#. comments-extracted-from-source-code
#: origin-of-source-code-string
#, options-such-as-fuzzy
#| msgid previous-source-string
msgid "source-string"
msgstr "target-translated-string"

Listing 2.14 Structure of an entry in PO file

using text editors or dedicated programs such as Poedit.14 A PO file contains


multiple items, each item being composed of an original untranslated string and
its corresponding translation. Items in a typical PO file usually belong to a single
project (i.e. an application requiring translation). All translations in a given
file pertain to a single target language, which means that an application being
localized in N languages will require N .po files to store translations. Listing 2.14
shows the structure of a typical item or entry in PO file.
Entries, which are usually separated with blank lines, may start with a number
of comments that are preceded with the # character. Comments can be either
entered by translators or extracted from the source code. Additional information
such as the origin of the source strings (e.g. the file name of the file containing
the strings) may also be included. The most important lines are those starting
with msgid and msgstr, which contain the source string and target (translated)
string respectively.
The next section provides more information on the XML markup language on
which the XLIFF standard is based.

2.5.2 XML
XLIFF, the XML Localization Interchange File Format, is a type of XML
document which is used to exchange information during a localization
project.15 In order to understand better what XLIFF is, it is necessary to first
explain what XML is. XML can be described as a markup language, which,
as pointed out by Savourel (2001), is composed of two different components.
The first one is a metalanguage, with syntactic rules allowing the definition
of multiple formats. The second component is an optional document type
definition (DTD) which defines a format for a specific purpose using a pre-
defined number of keywords (known as the vocabulary). XML is one of these
metalanguages, which explains why multiple types of XML exist, serving very
different purposes. For example, XSL (Extensible Stylesheet Language) is a
type of XML which is used to transform XML documents into other formats,
while SVG (Scalable Vector Graphics) can be used to handle text and vector-
based graphics. In the software publishing industry, XML is commonly used to
create source documents (such as How To topics) because once the document
has been created, it can be transformed into multiple output formats, such as
Programming basics 35
11<para>
2 <indexterm xml:id="tiger-desc" class="startofrange">
3 <primary>Big Cats</primary>
4 <secondary>Tigers</secondaryx/indexterm>
5 The tiger is a very large cat indeed...
6 </para>
6
7
8 <para>
8
9 So much for tigers<indexterm startref="tiger-desc" class="endofrange"/>.
10
10 Let&apos;s talk about leopards.
11
ll</para>

Listing 2.15 Example of a Docbook snippet

a HTML page or a PDF file. This means that it is not necessary to create the
same information twice. Examples of popular XML DTDs used for source text
authoring include DITA (Darwin Information Typing Architecture), Docbook
and oManual.16, 17, 18 Listing 2.15 shows what a Docbook snippet looks like,
with text surrounded by markup.19
In this example provided under the terms of the GNU Free Documentation
License, a number of tags is used.20 A tag starts with a < character and ends
with a > character and contains some of the DTD’s pre-defined keywords. A
tag consists of a name, such as indexterm, and may also have attributes (such
as class). Attributes are additional properties, which provide supplementary
information about the tag or the tag’s contents (which may be textual).
Attributes have values, such as startofrange, which may be used to store
metadata (i.e. information about the data). These values may be pre-defined
or used in a customized manner. Let’s examine each line one at a time to
understand better what each tag does.
The first line contains an opening para tag without any attribute. This tag is
used to create a standard paragraph element (say within a chapter or an article).
The second line contains an opening indexterm tag, which is nested below the
para element (as shown by the indentation). Since the indexterm element
belongs to the para element, this relationship is often described as a parent/child
relationship. In this example, the indexterm element is a child element of the
para element. Such an element is used to identify text that must be placed in
the index of the document. This indexterm element has a couple of attributes:
the first one is xml:id and the second one is class. These attributes have the
tiger-desc and startofrange values respectively. As mentioned earlier, these values
provide additional information (known as metadata) about the actual content
contained in the XML structure. The value of the xml:id attribute and the value
of the class attribute indicate that this indexterm points to a document range
(rather than a single point in the document). The third line contains another
opening tag, this time a primary tag which is a child element of the indexterm
element. This primary element does not have any attribute but contains textual
content (Big Cats), which would appear in the index of the document. Finally, a
closing primary tag is used to indicate the end of the primary element. A closing
36 Programming basics
resembles an opening tag, except that the < character is followed by a forward
slash character. Unlike an opening tag, a closing tag cannot have any attribute.
The fourth line contains another child element of indexterm, a secondary
element. This element comprises an opening tag, textual content (Tigers) and a
closing tag. Finally this line contains the closing tag for the indexterm element.
In XML documents, the syntactic structure is created by the tags rather than the
line breaks or the indentation, which is why multiple elements are sometimes
present on the same line. This closing tag marks the beginning of the range
which is linked to this indexterm. This range starts on line 5, with the textual
content of the para element. Unsurprisingly this content refers to a tiger, with
the text starting with The tiger is a very large cat indeed…. This para element
ends on line 6 with a closing tag. Line 7 contains multiple elliptical dots which
indicate that the document may contain additional content, which still belong
to the range specified earlier with startofrange. A new paragraph starts on line
8 with an opening para tag, followed by textual content on line 9, which still
refers to tigers. Line 9 finishes with an indexterm element, which happens to
be an empty element. An empty element contains some information (such as
attribute values), but does not contain any textual content. Such elements are
easily identifiable with a forward slash preceding the closing > character. This
indexterm element is used here to indicate the end of the tiger-desc range which
had been created on line 2. This is confirmed by the textual content on line 10,
which mentions leopards. Finally, the second paragraph of this example finishes
on line 11, with the closing para tag. This example shows that XML markup
can be very useful to create (invisible) boundaries which span multiple logical
sections (such as paragraphs).
In the localization industry, XML is also extremely prevalent, with DTDs
such as TMX or XLIFF. TMX is the Translation Memory eXchange format,
which was initially developed by a special interest group of the now defunct
Localization Industry Standards Association (LISA).21 This format can be
used to export the content of a translation memory database into another
application. This scenario is likely to occur when multiple stakeholders are
involved. Some of these stakeholders may have a preference with regard to the
application that should be used during the translation process. In order to reuse
previous work stored in a different application, however, one needs to be able
to export and import translation memory segments. This is when TMX comes
to the rescue, by providing a container (DTD) which is understood by most
modern translation memory applications. Listing 2.16 shows an example of
such a document provided by the Okapi framework under a Creative Commons
3.0 BY-SA license.22, 23
Based on the detailed description provided for Listing 2.15, the XML structure
presented in Listing 2.16 should be quite straightforward to understand. This
example contains two tu elements, which correspond to translation units. Each tu
contains two child tuv elements, which differ based on the value of their xml:lang
attributes. For each translation unit, the first tuv element has an attribute value of
en-us while the second has a de-de value. These values refer to the American English
Programming basics 37
l<?xml version="l.0" encoding="UTF-8"?>
2<tmx version="1.4"Xheader ereationtool="oku_alignment" creationtoolversion="l"
segtype="sentence" o-tmf="okp" adminlang="en" srclang="en-us"
datatype="x-stringinfo"></header><body>
3<tu tuid="APCCalibrateTimeoutActionl_sl2">
4<prop type="Txt::FileName">filel_en.info</prop>
5<prop type="Txt::GroupName">APCCalibrateTimeoutActionl</prop>
6<prop type="Att::Test">TestValue</prop>
7<tuv xml:lang="en-us"Xseg>Follow the instructions on the screen.</segx/tuv>
8<tuv xml:lang="de-de"Xseg>Den Anweisungen auf dem Bildschirm
f olgen.</seg></tuv>
9</tu>
10<tu tuid="APCControlNotStableAction2_slO">
ll<prop type="Txt::FileName">filel_en.info</prop>
12<prop type="Txt::GroupName">APCControlNotStableAction2</prop>
13<prop type="Att::Test">TestValue</prop>
14<tuv xml:lang="en-us"><seg>Repeat steps 2. and 3. until the alarm no longer
recurs.</seg></tuv>
15<tuv xml:lang="de-de"><seg>Schritte 2 und 3 wiederholen, bis der Alarm nicht
mehr auftritt.</segx/tuv>
16</tu>
17</tu>
18 </body>
19</tmx>

Listing 2.16 Part of a TMX file

and German (from Germany) locales, as shown by the textual content present in
the respective tu elements. Each tu element also contains additional information
(metadata) in the value of its tuid attribute and in child prop elements (such as
the name of the file where the segment originated from).
As mentioned earlier XLIFF is popular in the localization industry. For example,
a software publisher or language service provider may look after the extraction
of translatable content from source files (including code and documentation).
However, the actual translation may be done by a translator so content must flow
from one stakeholder to another as smoothly as possible (without information
loss). XLIFF may be used in this context to allow the transport of the information
from one system to another. Systems that make use of the XLIFF standard
sometimes need to extend it to add system-specific information. To achieve this,
the namespace mechanism may be used, whereby vocabularies from several DTDs
may be used in a single XML document. This can add complexity in some cases
because to make use of non-XLIFF information, systems must be aware of these
extra DTDs (which may not always be the case). Listing 2.17 shows an example
of an XLIFF document also provided by the Okapi framework under a Creative
Commons 3.0 BY-SA license.24, 25
The example provided in Listing 2.17 should be quite familiar after the
examples provided in Listing 2.15 and Listing 2.16. The first two lines of the
document refer to the version of the XLIFF DTD and namespace being used
(XLIFF 1.2). The third and fourth lines contain project-level information (a
file element with original, source-language and target-language
38 Programming basics
1<?xml version="l.0" encoding="UTF-8" ?>
2<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
3<file datatype="x-sample" original="sample.data"
4 source-language="EN-US" target-language="FE-FR">
5 <body>
6 <trans-unit id="l" resname="Keyl">
7 <source xml:lang="EN-US">Untranslated text.</source>
8 </trans-unit>
9 <trans-unit id="2" resname="Key2">
10 <source xml:lang="EN-US">Translated but un-approved text.</source>
11 <target xml:lang="FR-FR">text traduit mais pas encore approuvé.</target>
12 </trans-unit>
13 <trans-unit id="3" resname="Key3" approved="yes">
14 <source xml:lang="EN-US">Translated and <g id=,l’>approved</g>
text.</source>
15 <target xml:lang="FR-FR">Texte traduit et <g id=,l’>approuvé</g>.</target>
16 </trans-unit>
17 <trans-unit id="4" resname="Key4">
18 <source xml:lang="EN-US">Some other text.</source>
19 <alt-trans>
20 <source xml:lang="EN-US">Gther text.</source>
21 <target xml:lang="FR-FR">Autre text.</target>
22 </alt-trans>
23 </trans-unit>
24</body>
25</file>
26</xliff>

Listing 2.17 Example of an XLIFF file

attributes). The rest of the document is included in a body element, which


consists of trans-unit elements.
This section covered a few file formats but more will be introduced in
Section 5.4.4. However, the most important file-related concepts have been
introduced, including encoding, vocabularies (reserved keywords) and translatable/
non-translatable content. While the information contained in XML documents is
supposed to be human-readable, it can sometimes be a little overwhelming. When
changes must be made to such files, some programs are available to hide some of
the files’ complexity (by parsing the file and displaying only relevant information).
However, such programs are not always suitable for all editing tasks (e.g. bulk edits
across a number of files) so it is sometimes necessary to make changes to the file
itself using a text editor or a custom script. Most of the time, such changes will
require the identification of patterns in the file, so the next section will focus on a
powerful tool to achieve this task: regular expressions.

2.6 Regular expressions


Finding strings in the middle of source code or markup documents can be quite
challenging, so advanced tools are often used to extract translatable content
as explained further in Section 3.2.3. Such tools may be based on regular
Programming basics 39
$ dir
Dropbox README.txt first.py mydocument.docx
$ dir *.py
first.py

Listing 2.18 Use of wildcard to find specific files in a folder

expressions, which offer a way to define complex patterns in order to match


specific sections of text. Actually you may already have used some form of
pattern matching if you have performed searches for text in documents using
wildcards, say in a Microsoft Word application, or looked for specific files in a
folder using a command prompt, as shown in Listing 2.18.
This example contains the output of two dir commands. The dir command
can be used on multiple platforms (including Windows and Linux) to list the
contents of a particular folder (or directory) using the command line. However,
the output may contain results that are not always relevant. This example shows
that it is possible to use a pattern (defined as *.py) to show only those files that
end with .py. In this particular scenario, the asterisk (or star, *) is used to refer to
all combinations of characters, which means that all files whose names end with
.py are returned (including first.py).
Regular expressions offer capabilities that go well beyond the wildcards
that are used to match any character or any number of characters. One of
the disadvantages of regular expressions, however, is that many syntaxes (or
flavours) exist depending on each programming language’s implementation.
For instance, the Python syntax for regular expressions is slightly different from
the one used by the Perl programming language. But once the concepts have
been mastered, it is not difficult to make small adjustments to switch from one
flavour to another. Detailed explanations (including examples) are provided in
an online tutorial that was specifically created for this book.26 While it is not
strictly necessary for all readers to complete this tutorial, references to regular
expressions will be made in other chapters so having a basic understanding of
this tool seems beneficial.

2.7 Tasks
This section contains six basic tasks and two advanced tasks:

1 Setting up a working Python environment


2 Executing a few Python statements using a command prompt
3 Creating a small Python program
4 Running a program from the command line
5 Running Python commands from the command line
6 Completing a tutorial on regular expressions
7 Performing contextual replacements with regular expressions (advanced)
8 Dealing with encodings (advanced)
40 Programming basics

2.7.1 Setting up a working Python environment


In this task, as well as the remainder of this book, version 2.x of Python will
be used as explained in Section 1.5 (where x corresponds to a major version
number between 0 and 9). There are fundamental differences between versions
2.x and 3.x of Python, so make sure to select a 2.x version (such as 2.7.y, where y
corresponds to a minor version number between 0 and 9) if you want to execute
the examples provided in this book. There are two ways to set up a working
Python environment: the first one involves having Python installed locally
on your system (computer) while the second one involves using a cloud-based
service. There are advantages and disadvantages with each of these methods.
The obvious disadvantage of using a cloud-based solution is that you need to
have a relatively fast Internet connection to use the service. While this may have
been an issue a few years ago (with slow and unreliable connections), modern
connections allow for devices (such as computers) to be connected at all times,
thus making this approach a viable solution. Besides, this solution means that you
do not have to worry about installing or configuring anything. If you choose your
service provider carefully, you may actually find out that using an online service
is easier than having to set things up by yourself. Before selecting any online
service, however, you should obviously make sure that you agree with their terms
and conditions. Depending on whether you would like to use a local or remote
environment, go to one of the next two sections.

Setting up a local working Python environment


If your (desktop) system is running Linux or OS X, Python is likely to be already
installed on your system. To check that Python is available on an OS X system,
start a Terminal window by double-clicking on Applications/Utilities/
Terminal.27 Next, type python and press Enter. To check that Python is
available on a Linux system, start a Terminal window, type python and press
Enter. If you are running Windows, however, it is unlikely to be installed.
To install it, you will need to download it from an online source (such as
https://www.python.org) as mentioned in the official Python documentation.28
Since Python is released under an open-source agreement, it is freely usable
and distributable, even for commercial use. This means that multiple releases
are available online (some of them are free while others must be paid for).
Depending on what you choose, you should be able to execute all of the
examples provided in this chapter. One of the simplest ways to install Python
is to download one of the (version 2.x) installers made available on the Web
site maintained by the Python community.29 Once you have downloaded the
file, open it, install Python like any other program and check that it has been
installed properly. To do this last step, start a command-line window. This can
be achieved on most systems by pressing the Windows key + R, typing cmd or
powershell and pressing Enter. Once this is done, a black window should
appear in which commands may be entered. These steps may vary depending
Programming basics 41
$ python
Python 2.7.6 (default, Max 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Listing 2.19 Starting a Python prompt from the command line

on your version of Windows, but additional information may be found online


(e.g. on the Windows Web site).30 Once the window is open, type python and
press Enter. If you get an error message about python not being recognized as a
command, you should check your environment variable settings to make sure
that the directory containing the python.exe file is in your PATH environment
variable.31 If you have never done this before, you may find a video tutorial
more useful.32
Regardless of your system, success is achieved if a command prompt appears
(displaying information about the version of Python and a blinking cursor fol-
lowing three greater than signs), as shown in Listing 2.19 with the version 2.7.6.
Remember that version 2.x of Python is used throughout this book. So if the
prompt shows Python 3.x.x, you will have problems following the examples. To
work around this problem, you could try to run python2 from the command line.
If the command does not succeed, install one of the Python 2.x versions.

Setting up a remote working Python environment


If you do not have the rights to install programs on your system or if you think
that your system is not supported, or if you simply do not want to go through the
process of installing anything, you could consider using an online service, such as
PythonAnywhere.33 This type of cloud-based service (for which you will have to
register and accept terms and conditions) allows you to run Python programs in
a Web browser, thus bypassing the need to install any software on your machine.
This service may be a viable alternative if you have a good Internet connection
and feel comfortable using remote third-party hardware and software for your
computing activities. Obviously cloud-based services may change rapidly (or even
go out of business) so the screens provided below may not exactly correspond to
what you may see. Once you have registered with PythonAnywhere, you can start
a Python 2.7 session as shown in Figure 2.3.
Once the session has started, a Python prompt similar to the one shown in
Listing 2.19 should appear.

2.7.2 Executing Python statements using a command prompt


Whether you are working locally or remotely, a Python prompt allows you to enter
statements. The standard prompt is recognizable by a flashing cursor following
three greater than characters (>>>). Simple Python statements are entered one
at a time and executed by pressing Enter.
42 Programming basics

pythonany where

Consoles Files Web Schedule Databases


Start a new console:
Python: 2.7 / 2.6 / 3.3 / 3.4 IPython : 2.7 / 2.6 / 3.3 / 3.4 PyPy: 2.7
Other: Bash | MySQL

Figure 2.3 Selecting a console in PythonAnywhere

Python 2.7.6 (default, Mar 22 2014, 22:59:56)


[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
» > print "hello world"
hello world
» > Print "hello world"
File "<stdin>", line 1
Print "hello world"

SyntaxError: invalid syntax

Listing 2.20 Python syntax error

Programming language interpreters are extremely sensitive to detail, so case,


spacing and punctuation are extremely important. For instance, the example in
Listing 2.20 shows that changing the word print to Print generates a syntax
error.
Thankfully, the interactive interpreter generates some information to help
you identify on which line the problem occurs. However, it is down to you
to realize that the problem is caused by a case error. Being able to recognize
such problems comes with experience, but thankfully a lot of information is
available online as many users have experienced similar problems before you.
If you are using the online environment suggested earlier, you may have
noticed that the PythonAnywhere’s Console selection page shown in Figure 2.3
contained other options. One of these options is to use a slightly modified
version of Python, called IPython (Perez and Granger 2007).34 IPython
offers an interactive environment where lines are numbered, as shown in
Listing 2.21.
Most of the examples used earlier in this chapter were created using this
environment. One of the advantages of this environment is that it allows the
user to annotate and save code snippets in a user-friendly way in order to share
them with other users (e.g. for collaborating). As aforementioned, the code
snippets used throughout this chapter can be found online and easily copied
and pasted to experiment.
Programming basics 43
Python 2.7.6 (default, Max 22 2014, 22:59:56)
Type "copyright", "credits" or "license" for more information.

IPvthon 1.2.1 — An enhanced Interactive Pvthon.


? -> Introduction and overview of IPython’s features.
"/.quickref -> Quick reference.
help -> Python’s own help system.
object? -> Details about ’object’, use ’object??’ for extra details.

In [1]: print "hello world"


hello world

Listing 2.21 Using an IPython environment

Figure 2.4 Writing a Python program using a text editor

2.7.3 Creating a small Python program


The previous task showed you how to enter commands using an interactive
prompt. The problem with this approach is that these commands will disappear
once you close or quit the interactive session. In order to work around this, you
can save some of your commands in a program file. A Python program is actually
a set of text instructions that can be executed by the Python interpreter. In this
step, let’s create a simple Python program by opening an empty file in a text
editor (for example Notepad++ on Windows, Gedit on Linux or an online one
such as the one provided by PythonAnywhere).35 Do not use a word processing
program such as Microsoft Word since a Python program is a set of plain text
instructions so you do not want to add formatting to it. In this empty file, type
the lines shown in Figure 2.4 and save your file to a location of your choice
using first.py as a file name. In this example, some of the words are highlighted
because the text editor used understands the Python syntax. There are plenty of
text editors which offer that functionality, so make sure to select one that you
are comfortable using.

2.7.4 Running a Python program from the command line


In this step, let’s try to run the program from the command line by using the
python command followed by the first.py parameter (which is the name of
44 Programming basics
$ python first.py
python: can’t open file ’first.py’: [Errno 2] No such file or directory

Listing 2.22 Python program not found

$ python /home/j3r/scrap/first.py
hello world
$ cd /home/j3r/scrap/
$ python first.py
hello world

Listing 2.23 Running a Python program from the command line

our program). For this command to succeed, however, we need to make sure
that the Python interpreter can find the first.py program. If this program is not
located in the current working directory, the error shown in Listing 2.22 will
occur.
To solve this problem, two solutions exist. The first solution consists in
providing the absolute name of the file (including the directory where it is
located). If you are using a Windows system, you should include double quotation
characters before and after the file name (e.g. “C:\user\My Documents\first.‌py”)
The second one consists in changing the working directory to the directory
containing the first.py file (using the cd command). The two solutions are shown
in Listing 2.23.
If you have decided to use PythonAnywhere as your working environment,
you will need to start a different console, a Bash console as shown in Figure 2.3.
When you do so, you will be presented with a command-line environment, in
which you can run your Python program. Note that this environment allows you
to upload files or even create files using a Web interface (using the Files tab from
Figure 2.3.)

2.7.5 Running Python commands from the command line


In the previous step, we have seen how programs (contained in files) could be
run from the command line. For very short programs, however, it is sometimes
quicker to execute commands without having to save them in a file. This
practice is sometimes known as creating one-liner programs. In Python, this can
be achieved by using the -c parameter after python, where -c means that the
actual program is passed as a string delimited with single or double quotes, as
shown in Listing 2.24.
It is also possible to execute more than one command using semi-colons as a
delimiter, as shown in Listing 2.25.
Obviously this technique should only be used when commands are relatively
short or when you are not interested in saving the commands. Take a few minutes
to experiment with commands of your choosing to make sure you familiarize with
this approach of running commands.
Programming basics 45
$ python -c "print ’hello world’"
hello world

Listing 2.24 Running a Python command from the command line

$ python -c "print ’hello world’; print ’hello world’.count(’o ’)"


hi
hello world
2

Listing 2.25 Running multiple commands from the command line

2.7.6 Completing a tutorial on regular expressions


In this task, you should complete the online tutorial on regular expressions.36 As
much as possible, try to experiment with the examples provided (e.g. changing
values) to get a better understanding of the concepts. These concepts should
be extremely useful when dealing with large volumes of text, especially text
produced by others. Being able to quickly find or count occurrences of specific
words, terms of phrases is a very common task when revising texts. While this
task may be relatively straightforward when the person in charge of the revision
is the same as the person who produced the translation, complications may
arise when the translation process has been conducted by an unknown number
of translators. As presented in Section 5.1.5, collaborative translation (or
crowdsourced translation) is becoming more and more popular, so knowing how
to effectively deal with and edit other contributors’ translations is becoming a
key skill.

2.7.7 Performing contextual replacements with regular expressions


(advanced)
In this task you will create a small Python program that should accomplish the
following steps:

1 Import the modules giving you access to codecs and regular expressions
functionality.
2 Read the content of a TMX file as UTF-8 and store its content in a variable.
3 Define a contextual regular expression to find all occurrences of a target
language word. This expression should be defined in such a way that source
language words will not be found.
4 Replace all occurrences of this word with a word of your choice and print the
resulting content to screen.

Once you have created this program and saved it in a file, you should be able
to run it from the command line.
46 Programming basics

2.7.8 Dealing with encodings (advanced)


The purpose of this task is two-fold: to make you download an unusual file from
a remote location and to make you become familiar with a character encoding
used in China: GB 18030.37 In order to get access to the file to use in this
exercise, you should open a Web browser and navigate to the URL indicated on
the companion Web site. When you reach this online file, you may notice some
strange characters on the page depending on the Web browser you are using.
You should download the file and save it to a location of your choice (e.g. using
File > Save (Page) As). Once the file is downloaded, you should make
sure that it can be accessed from your Python interpreter. If you are using
Python locally, you should start an interpreter and use the os.chdir function
introduced earlier in this chapter in Listing 2.2. If you are using an online
Python environment, you should first upload the file to this environment and
then start a Python interpreter. You can then use the os.chdir function to
navigate to the folder where you have uploaded the file. Once this preparation
step is completed, you should import the codecs module and read the content
of the file as shown earlier in Listing 2.3. Instead of using the UTF-8 encoding,
you should use the file’s actual encoding (GB18030). If you are curious, you
could try using the UTF-8 encoding and reflect on the resulting error message.
Do you think this error message could be related to the characters you saw on
the page in your Web browser? Since the file contains a number of characters
on each line, you could also try to read the content of the file as a list of lines
using the readlines() method (instead of the read() method). Once you
have stored these lines in a variable (say, lines), try to print each character to the
screen using a for loop after removing any unnecessary white space character
from each line (using the rstrip() method). You could finally print to the
screen the length of each line to make sure that the result corresponds to the
number of characters on each line. If you are interested in going further, you
could try to write these lines to a UTF-8 encoded file. The concept of file writing
in Python was not introduced in this chapter, but it is similar to file reading.
Instead of using the r argument when opening a file, the w argument should be
used. Instead of using the readlines() method, the writelines() method
can be used. If you get stuck you can find more information online, such as in
the official Python documentation.38

2.8 Further reading and resources


Now that you have completed your first programming tasks, you can take the time
to explore further regular expressions in Friedl (2006) or the Python programming
language by following online tutorials (either text-based, video-based or even
interactive).39, 40, 41 McNeil (2010) also contains detailed information on various
text processing tasks using the Python programming language (e.g. dealing with
encoding and strings and using regular expressions). Make sure to come back to
your working Python environment to put into practice what you have learned.
Programming basics 47
Programming (including regular expressions) may seem daunting at first, but as
long as you try things, pay attention to detail, practice often, make mistakes and
learn from them, it will get easier over time. Very quickly you will even realize
that you can accomplish very powerful things using relatively simple programs.
And perhaps more importantly, you will be much more confident when working
on software localization projects.

Notes
1 http://www.codecademy.com/tracks/python
2 http://www.bleepingcomputer.com/tutorials/windows-command-prompt-
introduction
3 http://www.ee.surrey.ac.uk/Teaching/Unix
4 http://en.wikipedia.org/wiki/Character_encoding
5 http://nedbatchelder.com/text/unipain.html
6 http://www.joelonsoftware.com/articles/Unicode.html
7 http://www.unicode.org/standard/standard.html
8 http://www.w3.org/QA/2008/05/utf8-web-growth
9 https://www.pythonanywhere.com
10 This example, like other code snippets from this section, can be found on the book’s
companion Web site.
11 http://en.wikipedia.org/wiki/Hard_coding
12 Error messages can be sometimes slightly cryptic, especially when one starts learning
a language. Copying and pasting these error messages into a search engine often
provides valuable information since it is quite frequent for a problem to have been
previously experienced by other users.
13 http://www.gnu.org/software/gettext/manual/gettext.html#PO-Files
14 http://www.poedit.net/
15 At the time of writing, version 1.2 was the official OASIS standard (http://docs.oasis-
open.org/xliff/xliff-core/xliff-core.html) but version 2.0 was on the verge of replacing it.
16 http://dita.xml.org
17 http://docbook.org
18 http://www.omanual.org/standard.php
19 http://www.docbook.org/tdg5/en/html/ch02.html#ch02-makefrontback
20 http://www.docbook.org/tdg5/
21 http://www.gala-global.org/oscarStandards/tmx/tmx14b.html
22 https://code.google.com/p/okapi/source/browse/website/sample14b.tmx
23 http://creativecommons.org/licenses/by-sa/3.0/
24 https://code.google.com/p/okapi/source/browse/website/sample12.xlf
25 http://creativecommons.org/licenses/by-sa/3.0/
26 Accessible from the book’s companion site.
27 http://www.python.org/images/terminal-in-finder.png
28 http://docs.python.org/2/using/windows.html#installing-python
29 http://www.python.org/download/releases/
30 http://windows.microsoft.com/en-US/windows7/Command-Prompt-frequently-
asked-questions
31 http://docs.python.org/2/using/windows.html#configuring-python
32 http://showmedo.com/videotutorials/video?name=960000&fromSeriesID=96
33 https://www.pythonanywhere.com
34 http://ipython.org
35 http://notepad-plus-plus.org/
36 Accessible from the book’s companion site.
48 Programming basics
37 http://en.wikipedia.org/wiki/GB_18030
38 https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
39 http://greenteapress.com/thinkpython/html/
40 https://developers.google.com/edu/python/
41 http://www.codecademy.com/tracks/python
3 Internationalization

As discussed in the previous chapter, a software application may be written using


a number of technologies (e.g. programming languages and mark-up languages),
but it is often the case that one natural language is used for its interface and
resources (e.g. English). This is due to the fact that software applications are
often targeting a specific market, in which end-users speak a particular language.
When such a software application is initially designed, making it available in
multiple languages for different markets may not be a priority or requirement.
When this is the case, the source code may not be prepared for future localization
activities. In other words, the source code is not internationalized. When a
program is not internationalized, it may still be possible to have it localized, but
the localization process will be difficult and costly. Indeed, some effort will have
to be put in place to make sure that all localizable elements from the source
code are identified, and that, once localized, these elements display correctly and
without causing the target application to crash. It is worth mentioning that not
all developers of software applications are necessarily aware of localization-related
issues, so having to translate strings of a non-internationalized application is not
as infrequent an occurrence as it may seem. Also, not all programming languages
are well equipped in terms of internationalization and localization mechanisms.
Some of them have heavily standardized mechanisms, such as Microsoft’s .NET
framework which supports external resources (either in text format or XML
format).1 However, other programming languages do not have a standard way
to handle localization-related activities. One notable example is the JavaScript
language, which, despite its popularity and ubiquitous presence on the Web, lacks
robust internationalization support. This means that developers often have to
come up with their own methods to provide some localization support, instead of
relying on standards or best practices.
The first section of this chapter provides an overview of the various components
of an application that may be subject to internationalization or localization
activities. The second section presents some of the challenges that may arise
when a non-internationalized software application is being localized, as well as
a review of possible internationalization strategies. Finally, the chapter reviews
some of the internationalization techniques that can be applied to application-
related content other than software strings.
50 Internationalization

3.1 Global apps


In the 1990s and 2000s, traditional software applications used to follow a
well-defined model whereby the software was clearly separated from the
documentation that accompanied it. These days most software is acquired online,
which means that printed manuals are a thing of the past. However, user guides
are still common-place for software with more complex functionality. While a
simple mobile phone application (say a clock application) does not necessarily
require a user guide, an enterprise application (such as a database server system)
will require a well-documented set of guides.
In this chapter, as well as the remainder of this book, a simple Web application
is used as an example to explain some of the concepts and challenges associated
with the creation of a global, multilingual application. The term multilingual is
used here to refer to the fact that this application should speak multiple natural
languages (i.e. its user interface must be displayed in a number of languages)
and understand multiple natural languages (i.e. it must be able to process user
information regardless of the natural language used by the user to provide that
information). The first part of this section describes the components of a typical
global software application from a technical perspective, while the second
section explains further the concept of reuse, which is so prevalent in the software
publishing industry.

3.1.1 Components
The application used as an example in this book is a very simple Web application
that can be accessed by any Web browser.2 These days a lot of applications are
written in such a way in order to reach a wide range of users regardless of the
operating system they are using. In our example, the Web application itself is
written using a combination of technologies, including the Python programming
language that was introduced in Chapter 2, HTML (which is the main markup
language for displaying Web pages), and JavaScript libraries (namely JQuery,
JQuery UI and JQuery mobile).3 JavaScript is another programming language
which can be interpreted by Web browsers in order to create rich user interfaces
and make Web pages more dynamic. In our example the application is
accompanied by a set of additional HTML and PDF pages, which are generated
from XML content. While these pages could easily be generated by the Web
application itself, it seems important to introduce a number of technologies and
file formats to show and discuss multiple internationalization and localization
strategies.
The main component of our Web application is powered by the Python
programming language thanks to functionality made available within the Django
Web framework. The Django framework is an open-source project whose goal is
to make it easier and quicker to build Web applications with less code.4 Without
going into detail of this framework, it is important to present some of its key
components. The Django framework makes it possible to build applications
Internationalization 51
in a reusable manner based on a clear distinction between content storage,
manipulation and presentation. This approach is very different from earlier Web
sites (say, static HTML pages), which used to mix these three components, making
content maintenance and updates very difficult. Besides providing this modular
approach, the Django framework also offers great support for internationalization
and localization, which allows us to show the differences between an application
that is not internationalized and an application that is internationalized. In order
to develop our Web application, the following steps were required:

1 Decide how to store the data (content) used by the Web application. In our
case, the content is a set of sport (basketball) news items generated by a news
provider. These items are stored in a database for easy retrieval.
2 Decide how to present the content to the user. In our application, this
is done using a list, but other methods could be employed (e.g. a table, a
carousel). Since this presentation layer may change independently of the
data, templates (such as the ones used by the Django framework) are often
used to allow for quick modifications of the final appearance of the HTML
page.
3 Decide which functionality to make available to users. Our simple application
has only limited functionality since the only actions that can be performed
from the page include filtering the news items based on specific words and
going to the news provider’s Web site to read more about a particular news
item or player.
4 Give a name to this application. Since its purpose is to provide news items
related to the National Basketball Association (NBA) to a large audience,
the name NBA4ALL was chosen.

This application has the advantage of using a responsive, mobile-friendly


theme, which means that it displays fine on mobile devices, even when the size
of the screen is limited to smaller resolutions. The responsive design approach,
whereby an optimal viewing experience is achieved regardless of the device
used, is becoming extremely popular in the software industry. The amount of
information that is immediately visible by the user is obviously smaller when a
mobile Web browser is used but users can scroll down to reach the information
required.
Having the same theme for both a desktop-based application and a mobile-
based application is not always required or even desirable. In order to provide
a rich user experience, some software publishers (and as a result application
developers) tend to develop native applications for specific platforms. Specific
internationalization and localization strategies for a range of platforms and
application types can be found in Sections 3.5 and 4.7. In this volume the focus
is placed on a versatile application, which can be used in desktop and mobile
environments on a number of platforms (e.g. Windows, Linux, OS X). Such
an application is sometimes described as cross-platform because it does not
require platform-specific components to be used. This characteristic not only
52 Internationalization
applies to the user interface of the application, but also to the documentation
and help resources associated with this application. These resources can also
be accessed from any browser, so there is no need to create platform-specific
output formats.

3.1.2 Reuse
Reusing some (or most) components during the development and publishing
of an app is a core principle of the software publishing industry. Whenever
possible, software developers will reuse existing functionality instead of creating
them from scratch. Existing functionality can be found in previous apps or in
external collections of functionality (libraries or frameworks), which may be
licensed commercially or freely. There are cases when it does make sense to start
from scratch, but the myriad of open-source projects available on sites such as
Github or Bitbucket are a great place to start reusing somebody else’s code (if the
license allows it of course).5, 6 Reuse is such a pervasive element in the software
development lifecycle that it has a major impact on (at least) two aspects of a
global application. First of all, some of the text strings used to create the user
interface may be reused from one place to another in order to save precious time
and lines of code. As explained in the second part of this chapter this strategy can
work well in some cases, but it may have serious consequences when the context
changes. Second, some content (such as a file) can be written once but reused
multiple times in a variety of contexts. The previous section already covered
this scenario since the NBA4ALL application relies on HTML content that is
generated using the same Python code regardless of the target device used to
access the application.
A similar reuse approach can be used to generate a number of documentation
files from a single source file. In the past, some of the documentation of a software
product was created in a word processing application, such as Microsoft Word or
Adobe FrameMaker, without necessarily following a strict template or schema.
The source files were then transformed into an output format such as a PDF file,
whose layout often had to be tweaked by desktop publishers before it could be
published. These days, source file formats based on markup languages such as
XML are regularly used for the creation of documentation sets. Adding structure
to the source content makes it easier to manage (and reuse), tasks that can be
supported with the use of commercial programs.7, 8 A format such as XML also
presents the advantage of being easy to manipulate by (automatic) systems,
which means that adjustments to the generated output files are not as frequent as
what they used to be in the past. XML can be used to create several output types,
including HTML and PDF as examples in this section. The first step in creating
global content is to start with a source file (which can be created using a text
editor or dedicated XML editor), as shown in Listing 3.1.
The format presented in Listing 3.1 should look familiar based on what was
introduced in Section 2.5.2. This document starts with an XML declaration to
refer to a specific version of the DocBook standard. It is then composed of an article
Internationalization 53
<?xml version="l.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http ://docbook.org/xml/4.2/docbookx.dtd">
<article>
<t itle>NBA4ALL Documentat ion</t itle>
<sectl>
<title>Filtering a list of headlines</title>
<para>
By default, ten headlines are shown in the application’s main page.
In order to filter this list, a word can be entered in the Search
box.
The list will change as soon as you start typing in the text box.
</para>
<note><title>Limitations</title>
<para>It is currently not possible to search for multiple
words.</para>
</note>
<tip><title>Tip</title>
<para>Both titles and descriptions are searched.
Searching for generic words may return more results than originally
thought.</para>
</tip>
</sectl>
</article>

Listing 3.1 Documentation in source XML file

element, which contains a title and a section (sect1). This section comprises a
title, a paragraph, a note and a tip. Both the note and the tip contain a title and
a paragraph. The purpose of the document is to describe the core functionality of
the NBA4ALL application. Even though the application is extremely basic, some
of its characteristics may be worth describing to aid novice users. For example,
the search functionality of the application only supports one word, so a note
element is used to mention this limitation. An additional recommendation is
also provided in the tip element.
It is worth pausing for a moment to reflect on the very narrow focus of this
document, which is about filtering a list of headlines. Creating documents
with a narrow focus on a specific topic is a key characteristic of global content
publishing. Once again, one of the main advantages of such an approach is
that these small chunks of information can be reused in multiple contexts.
For example, it is quite common for a software product to have a short Getting
Started guide, a longer user guide, and possibly an even longer administration
guide. Depending on the target audience(s), parts of these documents may be
common to all documents. Rather than creating monolithic documents, it is
therefore preferable to break these documents into smaller chunks (or topics)
with a view to using them more than once. Obviously, creating a large number of
chunks can lead to information management issues (e.g. is a chunk really suitable
for multiple contexts? Is the chunk management system powerful enough to
ensure that it is more efficient to look for an existing chunk instead of creating
54 Internationalization
$ xsltproc -o doc.html
/usr/share/xml/docbook/style sheet/nwalsh/xhtml/docbook.xsl doc.xml
$ xsltproc -o doc.fo /usr/share/xml/docbook/stylesheet/nwalsh/fo/docbook.xsl
doc.xml
Making portrait pages on USletter paper (8.5inxllin)
$ fop -pdf doc.pdf doc.fo

Listing 3.2 Documentation transformation commands using XSL

it from scratch?). We cannot discuss these questions in detail, but it is worth


highlighting the fact that as a result of this chunk and reuse approach, large
translation projects of software documentation are not as frequent as they used
to be. Even though a localized application may be accompanied by a thousand-
page long documentation set, only a small subset might actually be translated
and will then be reused multiple times.
While the content contained in the document presented in Listing 3.1 is
human-readable, it can be transformed into more user-friendly formats such as
HTML, PDF or EPUB, the latter being a distribution and interchange format
standard for digital publications and documents developed by the International
Digital Publishing Forum.9 Transformations can be achieved using tools such as
xsltproc and fop and existing transformation stylesheets.10, 11, 12 Stylesheets
are documents describing how source documents should be converted into other
documents. The commands shown in Listing 3.2 are specific to those used on a
Linux machine and may vary from one environment to the next, but the most
important thing to keep in mind here is that one single source document can be
used to generate multiple documents that will look different from each other.
The transformation from XML to HTML is achieved using one command (on
the first line of Listing 3.2), while the conversion from XML to PDF requires an
intermediary step (using the XSL Formatting Objects (FO) language), which
generates some information during its execution (about portrait pages being
created). Once the commands are executed, two different-looking files are
generated, as shown in Figures 3.1 and 3.2.
This example, while extremely simple, is useful to present powerful global
publishing concepts. First, the formatting of the source XML document does
not matter. Listing 3.2 shows that line breaks, which were present on lines 8,
9 and 16 in order to separate sentences, are ignored in the final documents (be
it HTML or PDF). Second, extra information, such as a table of contents, can
be introduced by the transformation documents. While our source document
did not contain any table of contents, the final files contain one. Third, some
differences exist between the final documents. For example, the table of
contents of the PDF document contains page numbers while the HTML page
does not. In short, creating reusable content is one of the key characteristics of
global content publishing. Once this content has been created, the next step
is to make it available to a global audience. This topic is discussed in the next
section.
Internationalization 55
file:///home/jr/doc.html

NBA4ALL Documentation

Table of Contents

Filtering a list of headlines

Filtering a list of headlines


By default, ten headlines are shown in the application's main page. In order to filter this list, a word can be entered in the Search

Limitations

It is currently not possible to search for multiple words.

Tip

Both titles and descriptions are searched. Searching for generic words may return more results than originally thought

Figure 3.1 Documentation in HTML format

NBA4ALL Documentation

Table of Contents
Filtering a list o f h ead lin es.................. 1

Filtering a list of headlines


By default, ten headlines are shown in the application's m ain page. In order to filter this list, a word can
be entered in the Search box. The list will changc as soon as you start typing in the text box.

Limitations
it is currently not passible to sc arch for multiple words.

Tip
Both titles and descriptions arc searched. Searching for gcneric words m ay return more results
than originally thought.

Figure 3.2 Documentation in PDF format

3.2 Internationalization of software


3.2.1 What is internationalization?
LISA, the now defunct Localization Industry Standards Association, used to
define internationalization as ‘the process of generalizing a product so that it
can handle multiple languages and cultural conventions without the need for
redesign. Internationalization takes place at the level of program design and
document development.’13
The previous section of this chapter has already covered some aspects of
program document development (with the concept of reuse). Additional
56 Internationalization
internationalization characteristics are identified in this section. According
to Esselink (2000: 3), another ‘important aspect of internationalization is the
separation of text from the software source code. Translatable text, i.e. text which
is visible to the user, should be moved to separate strings-only resource files.’ It is
worth reflecting as to why this step is required. After all, one could simply make a
copy of the files that contain the translatable strings and ask a human translator
to manually replace the English strings with their translations in a given target
language. Obviously this approach has severe limitations: first, it would be prone
to errors since it would be possible to break the code or miss strings during the
translation process. Second, creating a full copy of the original source files would
duplicate code unnecessarily. If a source code file only contains ten per cent of
lines requiring some translation, why duplicate 90 per cent of the original file in
N target languages when this would dramatically increase the size of the program?
Finally, performing a manual translation of a source file (in a text editor) may
prevent most human translators from directly leveraging translation aids (such
as glossaries, machine translation or translation memory) unless the text editor
is equipped with plug-ins to access these resources (which is unlikely). For these
reasons, translatable strings are usually extracted into an intermediary file, which
is then made available to translators.
While translatable text strings are important, they are only one component
of the cultural data that makes up a particular locale. Another aspect of
internationalization therefore concerns the additional functions that help
programmers access and manipulate locale-specific data. Once components have
been prepared for multiple locales, they can be made available to users via a
global gateway, as discussed in the latter part of the next section, whose focus is
on engineering-related internationalization tasks.

3.2.2 Engineering tasks


Functions that are subject to internationalization include those required to parse
various input types (e.g. character sets as explained in Section 2.3.1) and support
locale-specific functionality (e.g. sorting text information, use and display of
date/time information, currency and/or numbers).

User input and output


The NBA4ALL application contains a search box where characters can be
entered. In order to make sure that this feature will be available to users who
speak and write a language other than English, the page must be coded in such a
way that non-ASCII characters can be entered into the search box (and matched
against the page’s text). This can be achieved in our application by setting the
encoding of the HTML page to UTF-8, as follows:14

<meta http-equiv=”Content-Type” content=”text/html; charset=


  utf-8”>
Internationalization 57
Depending on the language a user wants to see displayed on-screen using a
given application, complex user input methods sometimes have to be used by
the operating system. Some languages have limited character sets, which can be
easily covered by a standard keyboard with 101 keys. Not all languages, however,
have such small character sets so other input methods have to be considered.
For instance, it is very common for languages such as Chinese and Japanese to
rely on Input Method Editors (IME) which attempt to ‘guess which ideographic
character or characters the keystrokes should be converted into. Because many
ideographs have identical pronunciation, the IME engine’s first guess isn’t always
correct. When the suggestion is incorrect, the user can choose from a list of
homophones’ (Dr. International 2003: 175). Some keyboards allow users to enter
phonetic syllables directly (such as Kana in Japanese), but other keyboards with
Latin characters can also be used to form such syllables.
Another approach that can be used to allow users to input data in a given
application is to use handwriting. For instance, the OS X system allows users
to draw Simplified and Traditional Chinese characters using dedicated Apple
hardware.15 Obviously the adaptation work that is required to support additional
languages besides the default language(s) should not be underestimated. But
it is a good reminder that having a localized user interface may not always be
sufficient. If users want to interact with such an interface, data input must be
done in an intuitive manner, without forcing users to use an approach they are
not familiar with.

Formats
It is also very important to make sure that locale-specific information (such as
dates and times) is handled correctly by an application. The Django framework
provides such functionality given that internationalization and localization are at
the core of this project’s philosophy.16 By activating a feature in the application’s
configuration settings, dates and times are subsequently displayed using the
format specified for the current locale. In our scenario, translations have yet to be
performed for most locales, but date-related strings appear in all languages when
the user selects a language from the gateway’s language list.17
When applications cannot rely on a framework, such as Django, to
provide internationalization support, they sometimes have to rely on the
internationalization of the platform on which the application is executed. For
example, a desktop application can leverage some of the settings offered by an
operating system (such as Windows or Linux). Dedicated resources also exist for
programming languages that do not handle standards such as Unicode by default
(e.g. those provided through the ICU project).18
Manipulating data in a range of languages is no trivial task. Most programming
languages provide core functionality to perform basic text manipulation tasks
regardless of the language being manipulated (e.g. extracting the first character
of a text string as demonstrated in Listing 2.4 in Section 2.4). However, more
advanced functionality will sometimes be limited to certain languages. For
58 Internationalization
example, let us consider sorting a list of text strings in alphabetical order based on
the first character of each string. If the function sort is limited to characters from
the English alphabet (a to z), this function will fail or return incorrect results for
languages that use accented characters or do not use any English character. Dealing
with this type of issue falls under the remit of internationalization engineering or
functional adaptation (rather than translation tasks), but it is worthwhile being
aware of them. In some cases, adding support for additional languages may require
some translation or adaptation tasks in which translators or language engineers
may be involved. This point will be covered in greater detail in Section 6.3.3.

Access via the global gateway


The term global gateway was coined by Yunker (2003: 168) to initially refer to
the language drop-down list present on multilingual Web sites. To some extent,
this concept can be extended to most software applications that support multiple
languages. For example, a mobile phone sold in a particular country may have a
default language set up for its user interface. However, this default language may
not be the first choice of all users. So being able to provide an intuitive way to
select an alternative language is a key component of a successful global application.
Having a language list is crucial in giving users the possibility to change the
language (or even appearance) of a Web page (or application) based on their
preferences. In order to do so, the global gateway must be easy to find in a given
user interface. For example, if the list is located at the bottom of a long page, it
may not be found by some users, who may decide to leave the page because their
understanding of the default language is insufficient. Figure 3.3 shows an example
of such a list added to the NBA4ALL application.

Figure 3.3 Global gateway


Internationalization 59
The language list appears when the button containing an icon and three dots
is clicked. Users can then select one of the languages available from the list. It is
always important to have the name of the actual languages in their own language,
otherwise users may easily get lost if the list displays language names in a language
they do not know. The icon comes from a Web initiative whose objective is to
standardize the way users can select or change languages on a Web site.19 The
author’s aim was to create a global icon that everybody can quickly understand.
Whether this icon becomes mainstream remains to be seen as globe or world
map icons are extremely popular on multilingual Web sites because they are easy
to recognize. Sometimes flags are also used to indicate language selection, but
this practice is not encouraged by organizations such as the World Wide Web
Consortium because a language can span multiple countries.20 In specific cases,
however, it might be conceivable to use a flag to indicate that the content of a
Web site only applies to certain countries. An example would be an e-commerce
site that only supports shipping products to a limited set of countries. In this case,
using a flag might be a good way to inform users that they will be able to purchase
something from the Web site. For this reason, the global gateway sometimes starts
by asking users to select their region (or locale) and then their preferred language.
Other gateways may ask users to select both at the same time. Whatever the
strategy used, it should always be possible for users to change this selection later
on, since mistakes can be made and circumstances can change.
While a language or region selection list is sometimes available, it is worth
explaining how the default language may be chosen. One approach is to rely
on the language of a user’s environment. In the case of a Web application, the
language used by the Web browser may be interrogated by the Web application
in order to select a default language. Yunker (2010: 65) refers to this type of
language detection as ‘language negotiation’. Such a strategy is not limited to
Web applications since desktop applications will sometimes rely on the language
of the operating system in order to pick a language for their own user interface.
This strategy is sometimes extended to the location of a given user based on the
IP address of their computer. However, this method is not 100 per cent reliable
since Internet Service Providers may be located in a country where the user
is not physically present at a particular time. For this reason, successful global
applications tend to retain the user’s preferences (after user validation) so that
language or region adjustments are not made on using incorrect assumptions.
Further discussion on this topic will take place in Section 6.2.2.
Another approach to pick a default language is to display the language that
is the most commonly used in a country code Internet top-level domain (e.g. de
in http://www.mydummydomainname.de). Such domain names are based on the
ISO 3166-1 standard and contain two letters such as de (for Germany) or jp
(for Japan).21 Very often, large multinational organizations will buy all of the
domain names associated with their brand. In this case, each domain name is
mapped to a particular default language (e.g. http://www.mydummydomainname.fr
with French and http://www.mydummydomainname.de with German). Identifying
which domain names users are most likely to use to access Web applications is not
60 Internationalization
always straightforward. As Yunker (2003: 369) asks, ‘which address should French
speakers enter when they want to find your French-language web site: .com or
.fr?’ The answer to this question lies in the way addresses can be constructed.
While the situation was quite straightforward in the early 2000s, it has become
slightly more complex with the possibility to use non-ASCII characters for both
top level domains (TLDs) (including country code domains, abbreviated as
ccTLD, such as .de and global domains such as .com) and actual domain names
(such as example in example.com). As Wass (2003: 3) mentions,

when the domain names were developed, they were seen as a tool to enable
the navigation of the network – to facilitate communication among the
network’s connected computers. They were not intended to communicate
anything in themselves. In the past fifteen years, however, TLDs and ccTLDs,
in particular, have, by their use and governance, constructed a space that
outwardly communicates cultural identities and values.

This is why the list of top-level domain names is now more than twice as
long as it was in the 1980s (with meaningful TLDs such as .works or .yokohama
being sponsored by specific entities).22 Even though some people will argue that
the ability to register internationalized top-level names is motivated by financial
considerations (i.e. in 2012, the initial price to apply for a new gTLD was
$185,000), one must admit that allowing non-ASCII characters in addresses is
long overdue.23 Since registering multiple domain names can be expensive, the
ISO codes are sometimes used as a prefix (e.g. http://de.mydummydomainname.com
or http://fr.mydummydomainname.com).

3.2.3 Traditional approach to the i18n and l10n of software strings


In the previous section, some internationalization issues were described, but
as far as translation is concerned, the main task concerns the processing of an
application’s text strings. Before delving into the techniques that are used to
make the localization process more efficient, let’s take a look at what happens
when source code is not internationalized using the Python programming
language. Section 2.4 explained how strings work in the Python programming
language, so one may be tempted to think that it is relatively easy to extract
such strings from the code (say, using regular expressions or dedicated tools) in
order to translate them. After all, strings are clearly identified with opening and
closing single or double quotes. However, this approach over-generates because
it will extract strings that are not meant to be translated. This is confirmed when
using a dedicated tool such as xgettext which scans source code files, such as
the Python source code used to generate the main NBA4ALL page, for strings
to translate. By default, this tool writes extracted strings in a file called messages.
po, which uses the Portable Object format introduced in Section 2.5.1.24 Figure
3.4 shows such a messages.po file opened in a dedicated translation tool, Virtaal,
which is able to display .po files in a user-friendly manner.25
Internationalization 61

Figure 3.4 Output of xgettext viewed in Virtaal

When looking at the strings in Figure 3.4, we can see that most of the
extracted strings, such as published or subheading do not appear in the actual
NBA4ALL application so they should not have been extracted (because they are
not translatable). To work around this problem, it is possible to look at the code
to check whether strings are translatable or ask the author of the application.26
Both solutions are time consuming compared to the one described below.
A typical software internationalization and localization workflow therefore
involves a number of steps:
1 marking translation strings in the source code
2 extracting them into a translation-ready format
3 translating them
4 compiling the resource containing the translated strings
5 loading the translated resources into the application.
Since the focus of this chapter is on internationalization, we will concentrate
on the first two steps in the present section and the next two sections. The
last three steps will be covered in Chapter 4. As mentioned earlier, the Django
framework makes it easy for Web developers to internationalize their applications
by marking text strings that require translations. Such marking is required in at
least two types of files: the Python code itself and the templates that are being
used to generate the final HTML pages.
In order to identify or mark translatable strings in the Python code of a Django
application, a special function is imported, translatable strings are prefixed using
an underscore character _ and wrapped within brackets, as shown in Listing 3.3.
62 Internationalization

lfrom django.utils.translation import ugettext as _


2
3def home(request, collection="headlines", direction=-l, key_sport="published"):
4
5 #Translators: This title is followed by a list of basketball news stories
6 subheading = _("Your latest NBA headlines.")

Listing 3.3 Internationalization of Python code in a Django application

This code snippet should look familiar to you by now. A special function
ugettext is imported on line 1 from the django.utils.translation
module. In order to avoid repeating the typing of ugettext in front of every
string (which would increase the size of the program), it is mapped to the
underscore character as a shortcut. The underscore character is used on line 6 to
wrap the text string assigned to the subheading variable (Your latest NBA headlines).
You may have noticed in this example, however, that two strings are not marked
with the underscore characters (headlines and published on line 3). These strings
are not translatable because they are used internally by the application to perform
specific tasks (that are invisible to the end-user of the application). Using the
ugettext function is therefore crucial in order to identify with confidence
strings that are translatable from strings that are not translatable (even though
they might look like they are).
What is therefore interesting to point out is that some extra lines of code are
required to internationalize the application. By default (and based on what was
presented in the previous chapter), the variable subheading would be defined by
assigning the text string as follows:

subheadline = "Your latest NBA headlines"

This explains why many applications are often written in a non-internationalized


manner, simply because it is simpler and quicker not to write these extra lines of
code when developing the initial application. Obviously, this approach can lead
to a huge amount of work afterwards if it is decided that the application should
be localized.
The source Python code file from Listing 3.3 contained only one of the strings
that appear in the application’s main page. Other strings reside in a different
type of file, called an HTML template, which is used to generate HTML markup.
Listing 3.4 shows what some of the template used in the NBA4ALL application
looks like.
First of all, the template starts with an import of the internationalization
functionality on line 1 with the <% load i18n %> declaration. Translatable
strings are then inserted within a <% trans %> block, as shown for example
on line 17, or <% blocktrans %> block (e.g. on line 25). These blocks
contain strings that appear in the actual application’s page (e.g. Choose Language
or See what your team has been up…). Note that this second string is not complete
because it actually contains a link to an image. You may notice, however, that the
Internationalization 63
H % load il8n # /«}
2<!DOCTYPE html>
3<html>
4<head>
•5<title>NBA4ALL</title>
6<!— Links to CSS and JS files omitted -->
7</head>
8<body>
9<div data-role="page">
10<div data-role="header">
ll<hl align="center" style="font-size:20px">NBA4ALL</hl>
12</div>
13<div role="main" class="ui-content" data-theme="a">
14<a href="#popupMenu" data-rel="popup" data-transition="slideup" class="ui-btn
ui-corner-all ui-shadow ui-btn-inline ui-icon-myicon ui-btn-icon-left
ui-btn-a">...</a>
15<div data-role="popup" id="popupMenu" data-theme="a">
16<ul data-role="listview" data-inset="true" style="min-width:210px;">
17 <li data-role="list-divider">-C# /« trans ’Choose Language’ %}</li>
18 i t for code, language in langs.items %}
19 <li><a href="http://{-( url }}/{{ code }}/">{{ language Ititle }}</a></li>
20 endfor %}
21</ul>
22</div>
23<h2 align="center">{{ subheading }}</h2>
24-C°/« comment c/0}Translators: The next sentence should be quite catchy.{y«
endcomment %}
25<p align="center">{c /0 blocktrans #
/0}See what your team has been up to thanks to
<a target=".blank" href="http://dummysource.com"><img
src="http ://dummysource.com/logo .png"x/a> ! {% endblocktrans c /0}</p>

Listing 3.4 Internationalization of an HTML template in a Django application

contents of some of the lines are not included in <% trans %> blocks. The string
on lines 5 and 11 is the name of the application (NBA4ALL), so in this scenario
it is deemed to be non-translatable. Obviously this decision is debatable because
application (or even brand) names are sometimes translated or adapted during
localization. Various approaches to handle this issue, which is specific to digital
content, are discussed by Baruch (2012). A special code block is also present on
line 23: {{ subheading }}. This block is used to insert the content of the
subheading variable that was defined in the Python code itself in Listing 3.3.
This example illustrates how Django’s templating system works. Variables present
in the HTML document (from Listing 3.4) can be replaced with content (e.g.
strings) defined in the Python code (Listing 3.3). This approach is extremely
popular in modern Web applications because it allows back-end developers (e.g.
Python developers) and front-end developers (e.g. HTML designers) to focus on
what they know and do best. When text gets created by two (or more) different
individuals, however, consistency issues may arise, which is why additional
internationalization techniques are presented in Section 3.2.4.
Variations of such an i18n and l10n workflow are possible depending on
the programming language or framework that is used to develop the source
64 Internationalization
l#The source_strings.py file contains: subheading = "Your latest NBA headlines."
2 import source_strings
3
4def home(request, collection="headlines", direction=-l, key_sport="published"):
5
6 print source_strings.subheading

Listing 3.5 Externalization of source strings

application. For instance, rather than marking translation strings in the source
code and extracting them into a translation ready format in a separate step, one
may decide to externalize translation strings directly into a strings-only file.
Listing 3.3 showed that the developer of an internationalized Django application
could still define source strings in the middle of source code. Other programming
languages and frameworks rely on separate files to completely isolate source
strings. Listing 3.5 shows how this could be achieved in the Python programming
language by adapting slightly the code shown from Listing 3.3.
In this adapted example, an external file (source_strings.py) is used to store all
strings, which can then be used by other parts of the program by (i) referencing
the external file (in this case, by importing the module on line 2 in Listing 3.5)
and (ii) accessing specific strings using arbitrary names (e.g. on line 6 with
source_strings.subheading).
This approach is commonly used in Windows applications that rely on the
.NET framework. In this framework, the external files containing translatable
source strings are called .RESX files (because the XML format is used to store
these resources). Similarly Java programs rely on properties files.27 It is also
possible to come across proprietary formats used by software publishers who have
decided not to rely on existing formats or could not do otherwise because the
language or framework did not provide a standard internationalization method.
Deciding whether two steps should be used instead of one will largely depend
on the framework or programming language used during development. From a
translation perspective, there should be minimal impact on the actual translation
work, but it does not do any harm to know which upstream steps were used to
generate the file requiring some translation.

3.2.4 Additional internationalization techniques


While the core internationalization step is to clearly set off translatable strings
from the rest of the code, other techniques can be used to make the localization
process more efficient. For example, you may have noticed that Listing 3.3 and
Listing 3.4 contained comments (on line 5 and line 24 respectively) to inform
translators about the context of a particular string or on how the text should be
rendered in a target language. Once again these comments are purely optional so
the easiest approach for a developer is to do nothing and omit such comments.
Omission can be due to two reasons: the developer (or string writer or string
editor) may not have the time to write comments for translators or may not
Internationalization 65
feel qualified to make comments or recommendations to translators. Comments,
however, can be extremely useful during the localization process, especially
when translators do not have access to the context (i.e. access to the page of the
application containing a particular string). This lack of access to context can be
caused by a number of factors:

• Expectations: an application publisher may expect the third-party translation


provider to be already familiar with a given application.
• Confidentiality: even if a non-disclosure agreement is in place between
publishers and translators, an application publisher may be reluctant to give
access to a running version of their application (or screenshots of the same)
to third-parties (to prevent leaks).
• Complexity: making each page of the application available to translators
(even in screenshot format) may require additional work that is not deemed
necessary by the application publisher.

To some extent, these factors are not specific to the ICT sector, since similar
issues are often reported in the film industry (e.g. confidentiality). Regardless
of the motivation for not providing any context or comments to translators,
localization-specific issues are likely to occur, especially when the product and
number of target languages are large. Such issues include mistranslations of
ambiguous source strings or truncated translated strings because of length issues.
These issues often have to be resolved during a localization quality assurance
step, but they could easily be avoided if more time was spent preparing the source
strings in the first place. Besides providing comments, other examples of source
string preparation include avoiding string concatenation, using meaningful
variable names (as discussed in Section 2.2), and paying special attention to the
way the plural form is generated.
The topic of string concatenation was introduced in Section 2.4.1. This
technique can be very appealing to application developers because it means
that they have less code to type. It is easy to see why this approach can lead to
significant translation problems where languages whose word order differs from
the source language are concerned. In the example in Listing 3.6, three short
strings on line 1 are concatenated in a single string on line 2.
This approach may be tempting from a reuse perspective, because if the
topic of the application was changed from basketball to American football
(i.e. from NBA to NFL), two strings might be reusable in English (Your latest
and headlines). Similarly, if the application also contained a section on tweets,
it might be possible to reuse Your latest and NBA to form the string Your latest
NBA tweets. However, major problems are bound to happen either during the

lfirst, second, third = .("Your latest"), .("NBA"), .("headlines.")


2 subheading = first + " " + second + " " + third

Listing 3.6 Bad use of string concatenation


66 Internationalization
translation process or during the publishing process. During translation, it might
be impossible for a translator to provide a unique translation for a given string.
For example, Your latest may require a translation that takes gender into account,
so having to translate without context is likely to create poor translations. The
second problem concerns word order. The subheading variable from Listing 3.6
concatenates the short strings in the order in which they appear in English.
However, word order will vary from language to language. For example, in
French, the result would be Vos derniers NBA titres instead of Vos derniers titres
NBA. In order to avoid these problems (which would be extremely costly to fix),
this type of string concatenation is discouraged. One way to work around the
word order issue presented here is to rely on substitution markers, which were also
introduced in Section 2.4.1, as shown in Listing 3.7.
The first example used between lines 1 and 16 does not solve the word order
problem. Relying on identical substitution markers (such as three %s on lines
4 and 12) is useless because it is not possible to express the fact that one %s
should be moved to a different location in the final string. Instead, name-specific
substitution markers are required, as shown on line 18. On this line, it is possible
to define a French word order (%(first)s %(third)s %(second)s), which is different
from the English word order (%(first)s %(second)s %(third)s). The final output of
the statement on line 22 returns the expected word order even though the three
components have been translated independently. It is worth mentioning again
that the use of substitution markers has limitations, which may lead to translation
situations that cannot be solved. When they have to be used, however, it is always

1# In:
2first, second, third = "Your latest", "NBA", "headlines"
3# In:
4subheading = "7,s "/,s "/,s." "/, (first, second, third)
5# In:
6print subheadings
7# Out:
8# You latest NBA headlines
9# In
lOfirst, second, third = "Vos derniers", "NBA", "titres"
11# In:
12subheading = "7,s "/0s "/0s." "/, (first, second, third)
13# In:
14print subheadings
15# Out:
16# Vos derniers NBA titres
17#In:
18subheading = "7,(first)s ‘ /.(second)s "/.(third)s." "/, {"first": first, "second":
second, "third": third}
19# In:
20print subheadings
21# Out:
22# Vos derniers titres NBA

Listing 3.7 Using substitution markers


Internationalization 67
recommended to make sure that they are given meaningful names to give a clue
to translators. Our example may be improved if %(first)s %(second)s %(third)
s were replaced by %(intro)s %(organization_name)s %(content_type). Another
problem associated with the use of substitution markers concerns pluralization,
especially in languages where there might be multiple plural forms depending on
the amount of entities being counted. Further discussions and solutions regarding
this engineering-related issue are available online.28, 29, 30
Additional examples of string-related issues, which are specific to the Microsoft
.NET framework, are also available online.31
Since some of these internationalization issues are well-defined, developers
often try to simulate the localization process before it actually takes place. This
technique, which is referred to as pseudo-localization can be implemented with
Django or Python applications using a tool called fakelion.32 This tool is based
on an iterative process that allows developers to spot strings that have not been
marked using the _ mechanism or identify strings that may become longer when
translated. The idea is quite simple and based on conversions of source text
strings. Such conversions include reversing the order of the strings or expanding
their width to identify space issues, as shown in Figure 3.5.
Figure 3.5 shows a page that somewhat resembles the NBA4ALL app’s main
page. The title is fully recognizable (NBA4ALL) because we saw that this
string had not been marked for translation. The other two strings show pseudo-
localization in action, with the strings Your latest NBA headlines and See what your
team has been up to! being reversed. Besides, the length of these two strings was
increased by using full-width Unicode character equivalents instead of traditional
A to Z and a to z characters. Space limitation is indeed a source of concern
for application developers because a translated string may not necessarily fit in
the space allocated for a given source string. By using longer pseudo-localized
strings, developers can quickly identify what may happen during the localization
and fix the problem during the development cycle (instead of having to wait
until the problem has been reported during a localization quality assurance step,
when it might be too late to fix anything). Space issues can also be avoided
by using a flexible, responsive display system, which may arrange or truncate
strings in a clean manner. In the mobile version of the NBA4ALL application,
story descriptions span multiple lines whereas it only spans one in the Desktop
version. This is due to the fact that no specific dimensions have been given to

NBA4ALL

senildaeh ABN tsetal ruoY


( ot pu n e c b s a h m a e t r u o y t a h w e e S

Figure 3.5 Pseudo-localized application


68 Internationalization
the element containing the description. Using layouts that can be rearranged
automatically based on the size of the text can be extremely useful when dealing
with multiple languages. In contrast, story titles may be too long to fit in the box
so three elliptical dots are sometimes automatically added by the JQuery Mobile
framework, thus truncating the text. Text truncations can be extremely frustrating
when they make text incomprehensible. In our application, however, most of the
content is displayed below the title and the user has the possibility to read the
full story by going to the original news provider’s Web site. From a translation
perspective, space constraints often have to be taken into account, which is why
it is important to keep an eye open for comments from developers. Additional
information on text size-related issues is provided by the Internationalization
group of the W3C.33
This section has covered a lot of ground, focusing on some of the challenges
that can be encountered when a Web application is not internationalized
properly. The next section will focus on some of the issues that can arise when
the content itself is not internationalized or prepared for translation.

3.3 Internationalization of content


As seen in the previous section, Web application publishers must plan ahead to
ensure they can quickly cater for all their multilingual customers. This challenge
is commonly regarded as a Web globalization issue, whereby Web content should
be designed and maintained with localization in mind (Esselink 2003a: 68).
While the previous section focused mostly on the user interface of a global Web
application, the present section introduces techniques that relate to the actual
content present on Web pages. Structure-related techniques will be described
first, before stylistic issues and techniques are addressed in great detail.

3.3.1 Global content from a structural perspective


Efforts are being made by working groups within the World Wide Web
Consortium (W3C) to standardize the creation of specific content within XML
and HTML documents, in order to simplify the access to (multilingual) Web
content. The W3C is the organization that works on creating and maintaining
Web standards.34 So far most of the examples presented in this chapter have
focused on content present in text files. However, content is sometimes present
in other media types, such as pictures. Such pictures may contain text, which
may be difficult to read, and possibly difficult to extract and translate. For this
reason, the W3C recommends the use of a technique known as text alternative
when dealing with images that are not purely decorative.35 This technique
may be used to make sure that people with low vision (who may have trouble
reading the text with the authored font family, size and/or colour) are not at a
disadvantage.36 Other users can also benefit from such text when placing their
mouse over graphics or logos to display tooltips, as in the example NBA4ALL
application.
Internationalization 69
Similar internationalization choices must be made when creating Web video
content, which may contain textual information (such as scene text or captions).
Selecting the right type of captions can be regarded as an internationalization task.
Pfeiffer (2010: 251) describes various types of captions. According to her, ‘captions
that are mixed into the main video track of the video are also called burnt-in
captions or open captions because they are always active and open for everyone to
see’. Like hard-coded strings, these captions are not flexible because they cannot
be separated from the core video content. Instead, Pfeiffer recommends using
either in-band (provided as a separate track in the media resource) or external
captions (provided as a separate resource and linked to the media resource through
HTML markup). Even though the HTML5 specification has recently become an
official standard, the track element is not widely supported by Web browsers.37
This element is supposed to provide a way for authors to specify external timed
tracks (as shown in one of Pfeiffer’s online examples).38
Another example of standardization from the W3C concerns the
Internationalization Tag Set (ITS), which provides translatability rules so that
translators (be they human or automatic systems) know whether an element must
be translated or not.39 According to the creators of ITS, such rules may be added
by content authors or information architects.40 The example in Listing 3.8 shows
how one of our earlier XML examples has been adapted to include an ITS rule.
In this example, the article element refers to a couple of namespaces:
the Docbook namespace and the ITS namespace. As explained in the previous
chapter, namespaces allow tag sets from multiple vocabularies to be mixed
in the same document. On the last line, the term NBA4ALL is included in a
phrase element. This element is assigned a translate ITS data category with
a value of no so that tools or people who will be manipulating this content are
made aware that this particular phrase must not be translated. The ITS data
categories go beyond specifying what is translatable or what is not translatable.
For example, the latest set of data categories include identifiers dealing with
terminology, notes (e.g. for translators), language information, provenance, or
the type of characters allowed in certain elements.41 Listing 3.9, which is taken
from the Recommendation version of the ITS document, presents an example
of an XML document that makes use of the allowedCharactersRule
element.42 This example shows how this element is used to specify that the *
and + characters must not be used in any of the content elements present in
the XML document. This global rule is defined in the head of the document,
using a regular expression contained in the value of the allowedCharacters
attribute of the allowedCharactersRule element. This regular expression

<?xml version="l.0" encoding="UTF-8"?>


<article xmlns="littp ://docbook.org/ns/docbook"
xmlns:its="http://www.w 3 .org/2005/ll/its"
its:version="2.0" version="5.0" xml :lang="en"
<title><plirase its:translate="iio">NBA4ALL</phrase> Documentation</title>

Listing 3.8 Using the ITS translate data category


70 Internationalization
<?xml version="l.0" encoding="UTF-8"?>
<myRes xmlns:its="http://www.w3.org/2005/ll/its">
<head>
<its:rules version="2.0">
<its:allowedCharactersRule allowedCharacters="[~*+]"
selector="//content"/>
</its:rules>
</head>
<body>
<content>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed
diam voluptua.</content>
</body>
</myRes>

Listing 3.9 Using an ITS data category to specify excluded characters

makes use of a character class. The character class (defined with the opening
and closing brackets) contains an initial caret character (^) which negates the
following characters (i.e. the * and + characters) to express the fact that any
character but the * and + character is allowed. While this rule is human-readable,
its target user is likely to be a program (e.g. a translation program) configured to
check that this rule is adhered to by entities manipulating this document (e.g. a
human translator during a translation step in a localization workflow).
While the present section focused on the structure of documents, the following
section presents some content internationalization principles from a stylistic
perspective.

3.3.2 Global content from a stylistic perspective


This section focuses on the authoring principles that are sometimes used during
the development process of content that belongs to an application’s ecosystem.
It is worth mentioning that some of these principles only apply to a subset of
the content types presented in Figure 1.1. For example, the use of controlled
language proved popular for the authoring of technical documentation since this
content type tends to favour technical accuracy over engaging content.

Overview of Web and technical content writing principles


Numerous style guides exist in the ICT sector to provide recommendations on the
language that should be used when writing technical documentation (be it Web-
based or print-based). Some of these style guides tend to be language-specific, such
as Kohl (2008) for English, while others provide guidelines that are language-
neutral. For example, some writing guidelines to produce comprehensible and
translatable Web content are presented by Spyridakis (2000: 376), such as the
recommendation to use ‘simple sentence structures and internationalized words
and phrases’. Such guidelines are based on principles such as text conciseness
Internationalization 71
and content scannability, the latter principle referring to the process of reading
specific sections or fragments of a given document (Nielsen 1999: 101). Text
conciseness is a general principle of technical communication, whereby writers
are often asked to avoid verbosity as suggested by D’Agenais and Carruthers
(1985: 100), Gerson and Gerson (2000: 31) or Raman and Sharma (2004:
187). However, excessive conciseness may sometimes have a negative impact
on the clarity of messages, due to ambiguities introduced by the compression of
information (Byrne 2004: 24). By removing words such as articles or prepositions
from the source text, the clarity of a message may be affected. This issue is likely
to occur if guidelines are too stringent. For instance, Gerson and Gerson (2000:
243) suggest that Web content sentences should contain between 10 and 12
words. If this guideline were to be enforced in a systematic manner, essential
syntactic components would be removed from certain sentences. According to
Nielsen (1999: 104) the scannability of Web content is motivated by two factors.
First, it is believed that users take 20–30 per cent longer to read from a screen
than from a page, and second, they rarely read a source text in its entirety. They
tend to focus on the part of the text that most interests them. Pym (2004: 187)
even goes as far as saying that users are ‘no longer readers due to the loss of
discursive linearity of texts’. However, the consumer of (Web-based) content is
not always a human, so while a reader may scan text, most automatic systems
(including machine translation systems) are likely to process the same text in
its entirety. This means that machine-specific writing guidelines (or rules) are
sometimes required to supplement human-specific guidelines.

From content writing guidelines to controlled language rules


A possible solution to the language barrier lies in a complete automation of
the translation process by using Machine Translation (MT). While MT will be
covered in further detail in Section 5.5 from a translation perspective, the present
section focuses on the controlled language rules that are often used to simplify the
text that may be machine-translated during a localization process. A Controlled
Language (CL) is a ‘subset of a natural language that has been specifically restricted
with respect to its grammar and its lexicon’ (Schwitter 2002: 1). The lexical and
grammatical restrictions that define a CL are therefore the results of well-thought-
out choices. The origins of CL can be traced back to as early as the 1930s with
Charles K. Ogden’s Basic English (Ogden 1930) which contained a restricted
lexicon of 850 words. It is worth noting that Ogden’s Basic English was never
designed with translation in mind, but rather to solve ambiguity problems such
as synonymy or polysemy for readers of English texts. Incidentally, these readers
were intended to be both native speakers of English and non-native speakers of
English. Ogden’s ideas were then emulated a few decades later in the automotive
industry with the introduction of Caterpillar Fundamental English (CFE). This CL
was ‘intended for use by non-English speakers, who would be able to read service
manuals written in CFE after some basic training’ (Nyberg et al. 2003: 261). A
survey of CLs by Adriaens and Schreurs (1992: 595) indicates that this CL was
72 Internationalization
quickly followed by Smart’s Plain English Program (PEP) and White’s International
Language for Service and Maintenance (ILSAM). The latter gave birth to the
Simplified English (SE) rule set developed by the Association Européenne des
Constructeurs de Matériel Aérospatial (AECMA). The SE rule set went on to
become a standard in the authoring of aircraft maintenance documentation. All of
these projects had one characteristic in common: they used CLs that were designed
to improve the consistency, readability and comprehensibility of their source text
for human readers. These CLs are therefore often regarded as Human-Oriented
CLs, and, according to Lux and Dauphin (1996: 194), are not adequate for Natural
Language Processing (NLP) due to their lack of formalization and explicitness.
This can be seen by some of the vagueness associated with AECMA SE’s rules for
descriptive writing, such as rule 6.2, ‘Try to vary sentence lengths and constructions
to keep the text interesting’.

Controlled language rules and (machine) translatability


The collaboration between the Carnegie Group/Logica and Diebold Inc. described
by Hayes et al. (1996: 89) and Moore (2000) extended the use of a CL beyond
seeking improvements to the consistency, comprehensibility and readability
of source documentation; Diebold Inc. was interested in introducing a CL to
optimize its translation workflow, which was using translators and translation
memories. The optimization of this workflow was to be done by ‘reducing word
count, increasing leverageable sentences, and reducing the amount of expensive
terminology’ (Moore 2000: 51). Despite reporting savings of 25 per cent in
translation costs with the introduction of CL, Moore mentions that other benefits
were harder to quantify, such as customer satisfaction or fewer support calls. The
first companies that used a CL to reduce their translation costs were Rank Xerox
using a SYSTRAN MT system (Adams et al. 1999: 250) and Perkins Engines
(Pym 1990) in the 1980s. Various companies soon imitated them, and one of the
most successful projects to combine CL with MT was the collaboration between
Caterpillar and Carnegie Mellon University throughout the 1990s. Whereas
Perkins Approved Clear English (Pym 1990) used a small number of rules (ten)
and a small lexicon, the CL developed at Caterpillar was characterized by its
strictness. As a revamped version of CFE, Caterpillar Technical English (CTE)
was specifically designed to improve the clarity of the source text so as to remove
ambiguities during the automatic translation process (Kamprath et al. 1998).
Despite a significant productivity hit in source authoring (Hayes et al. 1996:
86), which may be explained by interactive disambiguations that authors had
to perform, the introduction of CTE’s 140 CL rules and controlled terminology
enabled the heavy machinery manufacturer to significantly reduce translation
costs by publishing machine-translated documentation in multiple languages. The
particularly high and cumbersome number of rules can be explained by the fact
that the MT system used, which was based on the KANT MT system (Mitamura
et al. 1991), involved an interlingual process. The abstract representation of the
source text, obtained after the parsing of the English sentences, had to be universal,
Internationalization 73
so as to generate sentences in multiple target languages. This implementation of
CL for MT showed that the accuracy of the MT output depended heavily on
the level of control present in the source. The fact that such large companies
committed to such a paradigm may be explained by improved communication
between development groups and localization groups. Amant (2003: 56) explains
that for a long time, ‘members of both fields (translation and technical writing)
perceived their professional activities as separate from one another’.
Around the same time, similar CL and MT projects at General Motors
(Godden 1998) and IBM (Bernth 1998) showed that CL rules could significantly
improve the quality of MT output in various language pairs. Other benefits were,
however, more difficult to quantify.
This was mentioned by Godden and Means (1996: 109), who reported that
benefits such as higher customer satisfaction could not be measured but argued
strongly for the implementation of the Controlled Automotive Service Language
(CASL) rule set. Besides MT-oriented CL rules, general guidelines on machine
translatability are also present in the literature as those provided by Bernth and
Gdaniec (2002). Past projects by Gdaniec (1994), Bernth and McCord (2000)
and Underwood and Jongejan (2001) have shown that there are ways to measure
the machine translatability of a source text. One way to describe translatability
concerns the generation of ‘gross measures of sentence complexity’ (Hayes et al.
1996: 90). This process involves counting some of the following phenomena and
attributing some penalties: sentence length, numbers of commas, prepositions,
and conjunctions, supplemented by restrictions on some locally checkable
grammatical phenomena, such as passive and –ing verbs. A similar approach was
used in the Logos Translatability Index proposed by Gdaniec (1994).

Challenges with controlled language rules


The proliferation of projects described in the previous two sections suggests that
CL rules vary from one language pair to another or from one MT system to the
next. This has been confirmed by a study performed by O’Brien (2003: 111),
which found that eight English CLs shared only one common rule, ‘the rule that
promotes short sentences’. Besides, the impact of CL rules on machine translation
systems that are built using a statistical approach does not seem to be as clear-
cut as the one experienced when working with rules-based machine translation
systems. This has been shown in various studies, including Aikawa et al. (2007).
Another complication stems from the fact that some CL rules are sometimes
described in generic terms, which may make writers or content developers apply
more changes than they are required to. If developers perform undesirable and
unexpected changes, some writing rules ‘may even do more harm than good’
(Nyberg et al. 2003: 105). The lack of granularity in the definition of CL rules
may also be explained by the insufficient linguistic background of writers and
content developers in general. If the linguistic phenomenon addressed by a CL
rule is described in minute detail using linguistic terminology, content developers
may not be able to implement the required change.
74 Internationalization
Since technical writers are often domain matter experts and not linguists, their
potential lack of linguistic knowledge may prevent them from understanding
which word or phrase should be altered, replaced or removed. This issue is
especially relevant when the CL rule is only prescriptive, because writers are not
told what they are allowed to write. This issue has been raised by CL checker
developers who state that some of the AECMA SE’s examples ‘do not always
represent the best advice’ (Wojcik and Holmback 1996: 26). Uncertainties
surrounding reformulations can then also arise if a CL checking application does
not provide any alternative rewriting of problematic sentences. Applications
providing (controlled) language checking support are discussed in the next
section.

Language checkers
A CL checker may be defined as an application designed to flag linguistic
structures that do not comply with a predefined list of formalized CL rules.
Traditionally, most checkers have operated at a sentence level. For instance,
Clémencin (1996: 34) states that ‘the EUROCASTLE checker works at the
sentence level and has very little knowledge of the context.’ This can obviously
be problematic if some of the structures to identify include phenomena such as
anaphora (which may require resolution at the paragraph or document level).
Simpler programs, also known as proofreading or style checking programs, can
also be extended to perform some controlled language rule checks. The open-
source LanguageTool program falls into this category.43 It is defined by its author
as proofreading software that ‘finds many errors that a simple spell checker
cannot detect and several grammar problems.’ This tool is available in multiple
forms, ranging from an extension for open-source word processing programs
such as OpenOffice.org and LibreOffice to a standalone application. It checks
plain text content in a number of languages and detects text patterns (using a
number of techniques including regular expressions). Most rules are defined
in an XML format so that they can be easily edited and refined by end-users.
More complex rules can also be created using the Java programming language.
Figure 3.6 shows both input text and the results of the check in LanguageTool’s
graphical interface.
The results of the check (i.e. the rule violations) are reported by LanguageTool
in the bottom part of the user interface. Once the text present in the top part of
the interface is checked by the program, results are reported to the user including:

1 The position of the character that violates a particular rule. For example, the
first problem appears on line 8 in column 1 (where column means the first
character of the line).
2 The description of the rule (e.g. Sentence is over 40 words long, consider
revising).
3 Some context, including all the words that match this particular rule,
previous characters and following characters.
Internationalization 75

Figure 3.6 Rule violations detected by LanguageTool

This example shows that the detection rules have descriptions that read like
suggestions. It is therefore down to the user to decide whether implementing the
change will improve the overall quality of their text. It must be said, however,
that some of the rules can sometimes over-generate by triggering in contexts that
are perfectly legitimate (these are known as false alarms). When the precision of
the rule is too low, the rule can even become a source of frustration, which is why
it is sometimes possible to disable (or deactivate) a particular rule. The opposite
scenario is also possible. When a rule does not trigger in a context where it would
be expected to trigger, it is because the rule does not have a perfect recall. This
can be explained by a number of reasons: since rules tend to be created by people,
it is possible that these people have not thought of all possible combinations a
rule should cover. Another reason concerns the tools and resources that are being
used to power the checking procedure. In the example above, some of the rules
are more complex than others. For instance, the rule that detects a series of three
nouns has to rely on an external tool to determine what a noun is. Such a tool is
known as a part-of-speech tagger since it assigns a part-of-speech (POS) to each
word (or token) in a particular segment. LanguageTool allows users to assign POS
tags to their input text and to see the results of this process in the bottom part of
the interface, as shown in Figure 3.7.
76 Internationalization

Figure 3.7 Tagging a text to better understand checking results

Each word from the input text is followed by bracketed information (including
possible dictionary forms and part-of-speech tags separated with a / character).
In the example above, we can see that the sequence basketball news application
is detected as a series of three nouns in Figure 3.6 while the sequence basketball
news list is not. This is due to the fact that the word list is ambiguous and can
be assigned a verb part-of-speech in certain contexts. This is confirmed by the
output shown in Figure 3.7, where list was tagged as a noun (with the NN tag) but
also as a verb (with the tags VB VBP). Because of this ambiguity the rule did not
trigger in this particular context. This example shows that the right balance must
be found between false alarms and silence in order to make sure that the expected
benefits of using a language checker are obtained (e.g. improving the text’s
readability or machine translatability). Some of the tasks presented in the next
section focus on addressing specific problems associated for language checking
rules (e.g. evaluating the impact of source modifications on translation quality).

3.4 Tasks
This section contains three basic tasks and three advanced tasks:

1 Evaluating the effectiveness of global gateways


2 Internationalizing source Python code (advanced)
3 Extracting text from an XML file (advanced)
4 Checking text with LanguageTool
5 Assessing the impact of source characteristics on machine translation
6 Creating a new language checking rule (advanced)
Internationalization 77
The third task is actually optional but should be attempted if you want to
practice file and text manipulation from a programmatic perspective.

3.4.1 Evaluating the effectiveness of global gateways


In this task, you should open your favourite Web browser and navigate to Web
sites of multinational companies or global products (e.g. Nivea.com, Honda.
com or Ikea.com) to evaluate how these sites have implemented their global
gateways. The first step is to locate the global gateway if it is not immediately
obvious from the landing page. Once you have identified it, you should try to
select the configuration that corresponds best to your usual location (e.g. if
you normally reside in Austria, you should try to find a section of the site that
pertains to the Austrian locale). You should spend some time to reflect whether
these gateways follow the internationalization principles that were outlined in
Chapter 2 in the ‘Access via the global gateway’ section.

3.4.2 Internationalizing source Python code


In this task, a modified version of the simple number guessing game introduced
in Section 2.4 is used. Like other code snippets, its online location can be
accessed from the book’s companion site. While this version of the game
has been improved to introduce additional error checking (e.g. the program
informs the user when their input is not a number), the code has not been fully
internationalized. Only the library that allows for internationalization has been
imported. Indeed, some warning messages are generated by the xgettext, as
shown in Listing 3.10.
The goal of this task is therefore to make modifications to the source code so
that:

• Each translatable string is marked up with the _() construct used in


Listing 3.3.
• Values of function arguments such as yes are ignored.
• Strings documenting functions are ignored (i.e. strings starting with DOC:)
since these are comments for developers rather than end-users.
• Strings are formatted using meaningful names instead of %s or %d to avoid
warnings when the xgettext tool is run.
• Lines containing ambiguous or obscure translatable strings are preceded
with a comment giving some context about the string (as shown in
Listing 3.3).

Once you have made your modifications, you could run the xgettext tool
using the following command in a Linux environment:

xgettext -c secret3.py
$ xgettext -a secret3.py
secret3.py:36: warning: ’msgid’ format string with unnamed arguments cannot be
properly localized:
The translator cannot reorder the arguments.
Please consider using a format string with named arguments,
and a mapping instead of a tuple for the arguments.
secret3.py:47: warning: ’msgid’ format string with unnamed arguments cannot be
properly localized:
The translator cannot reorder the arguments.
Please consider using a format string with named arguments,
and a mapping instead of a tuple for the arguments.

Listing 3.10 Output of xgettext

#. TRANSLATORS: This message invites the user to pick a positive number.


#: secret3.py:16
msgid "Select a maximum positive number:"
msgstr ""

#. TRANSLATORS: 7,s is .("Select a maximum positive number:")


#: secret3.py:21
#, python-format
msgid "Wrong input. 7,s"
msgstr ""

#. TRANSLATORS: This refers to the "Enter" key from the keyboard


#: secret3.py:35
msgid "Enter"
msgstr ""

#. TRANSLATORS: This question asks the user to pick a number and press a key
#: secret3.py:39
#, python-format
msgid "Guess the number between 0 and 7,(max_number)s and press ‘
/.(key)s."
msgstr ""

#. TRANSLATORS: This string tells the user that they have found the number
after a certain number of attempts
#: secret3.py:50
#, python-format
msgid ""
"You’ve found ’"/,(secret_number)d’ in "/.(attempts)d attempts! Congratulations!"
msgstr ""

Listing 3.11 Expected output


Internationalization 79

Figure 3.8 False alarms reported on XML content

This command should create a messages.po file in your working directory (where
the -c parameter instructs the program to extract comments preceding lines with
translatable strings). You can then open this file using a text or dedicated editor
to check that all strings have been extracted alongside their comments. Ideally it
should look more or less like the solution file from Listing 3.11.

3.4.3 Extracting text from an XML file


One of the goals of this task is to reinforce the file and text processing skills that
you developed in the previous chapter. The first step of this task is therefore
to download an XML file and save it to a location that you can access from
Python commands. The file, which is available online, may be accessed and
saved using a Web browser following the steps outlined in Section 2.7.8. If
you would rather avoid clicking on graphical applications, you could also look
back at the urllib module to access the file programmatically. Once you have
downloaded and saved this XML file, the next step is to extract text from it.
This is required because LanguageTool only supports text files as input files. So
if we were to check an XML file directly, incorrect flags (false alarms) would be
reported, as shown in Figure 3.8 where valid XML constructs (such as version=)
are highlighted.
80 Internationalization
In order to extract text from this XML file, replacements based on regular
expressions can be used. You should now try to write a small Python program that
reads the XML file, removes any XML tag and writes the text to a file, which
should more or less look like the answer that is available online.

3.4.4 Checking text with LanguageTool


The goal of the next task is to interpret checking results provided by
LanguageTool. The first step is therefore to familiarize yourself with the program,
by looking at the rules it contains for a language of your choice.44 You then
have two options: you can either download LanguageTool to your system or
use an online version (which does not contain all of the rules and options of
the standalone version).45 Note that the standalone version of LanguageTool
requires Java to be installed on your system. If you have decided to use the local
version and you have satisfied the Java dependency, you can start the Graphical
User Interface presented in Figure 3.6 by running the following command:

java -jar LanguageToolGUI.jar

To check some text, you can type your own text in the top window of the
standalone program or in the demo window of the online application. If you
use the standalone program, you can check text files such as the solution file
provided for the previous exercise: udoc.out.46 If you use the online version,
you can copy and paste the content of this solution file into the demo window.
Take some time to edit your input text based on the problems identified by
LanguageTool. While doing so, you should ask yourself whether some of your
changes may lead to new problems if you triggered another check. If that was
the case, what should you do?

3.4.5 Assessing the impact of source characteristics on machine translation


In this task, a text containing mistakes that have been introduced on purpose
will be used. This text can be accessed programmatically or manually (either
by saving the content in a text file or by using copy-paste).47 Once you have
obtained this ill-formed text, you have two options. The first one consists in
using an online tool that provides an integrated environment to perform pre-
editing, machine-translation and post-editing.48 The second one is more manual,
as you should trigger a text with LanguageTool (either by using the standalone
or online version) and fix those problems that have been correctly identified by
the tool. You should also check the rest of the text to make sure that all problems
have been identified. For those problems that were not found by the tool, can
you think of reasons as to what may have happened? You should now have two
versions of the text: the original ill-formed text and your modified version. You
should now use online machine translation services to automatically translate
both versions into a language of your choice.49, 50, 51 If you find it difficult to
Internationalization 81
visualize what may have changed between the translations of the original text
and the source text (or between the translations provided by two different
systems for a given text), you could use an online tool to highlight differences.52
You should now spend some time to analyse the results by considering the
following:

• Did your modifications improve the overall quality of the translation?


• Are these improvements consistent among translation services?
• Are there problems from the original text that were corrected by some of the
machine translation services? If so, does that surprise you and why do you
think that might be the case?
• Do you think some of your source edits may have led to degradations in the
translation?
• If you are able to evaluate another target language and you repeat
the translation step using another language, do you still see the same
degradations or improvements? If so, what does this tell you about the
machine translatability of a source text?

3.4.6 Creating a new checking rule


The last task is an advanced task. While checking some text, you may have
noticed some problems that are not identified by LanguageTool. There are a few
possible explanations for this behaviour: the first one is that a rule exists but
its coverage is not sufficient to identify the problem that was present in your
text. The second explanation is that no rule has been created, and the third one
is that the problem cannot be detected with the technology that is currently
used by LanguageTool. If you feel the problem you are trying to solve falls in the
second category, you may try to create a rule. Simple rules can be created using
LanguageTool’s online editor.53 For example, you may decide that the term Search
box is no longer relevant in your documentation set. Instead you would like to
use the term Find box. Figure 3.9 shows how such a rule can be created using
a regular expression (to make sure that both box and boxes are detected). The
online system will even perform a test check on Wikipedia content so that the
rule’s impact can be assessed.
Once the Create XML button is clicked, instructions on how to use this rule
are provided by the system. Simple rules are created in a simple, human-readable
XML format, which can be copied and pasted into the XML file that is used
to check a particular language. Once this is done, the application should be
restarted and the rule selected so that it triggers as expected in multiple contexts,
as shown in Figure 3.10.

3.5 Further reading


Key internationalization concepts have been introduced here, with a focus
on specific technologies and document formats (such as Python, Django and
Figure 3.9 Creating a simple rule using regular expressions

Figure 3.10 Results provided by a newly created rule


Internationalization 83
DocBook). These concepts apply to a wide range of technologies, which are likely
to surface when working with digital content in a global context. Some of these
topics are regularly discussed at industrial events such as the Internationalization
and Unicode Conferences (though these tend to have a strong engineering
focus).54 Internationalization can be a very technical topic, which has already
been covered in length in the past for a number of languages or platforms such
as Windows (Dr. International 2003), Java (Deitsch and Czarnecki 2001), .NET
(Hall 2009 and Smith-Ferrier 2006), and Visual Basic (Kaplan 2000). Since
technology evolves very quickly, some of these resources are likely to be out-
dated. Additional resources can be found online for specific technologies or
standards, including (but not limited to):

• DITA translatability best practices55


• Pseudo-localization tool for the Java programming language56
• Internationalization resources for the PHP programming language57
• Internationalization for a wider range of programming languages58
• Internationalization (and localization) of a Windows Phone 7 application.59

Towards the end of the chapter, some natural language processing concepts
(such as part-of-speech tagging) were briefly mentioned. Further information on
this topic can be found in Bird et al. (2009) or Perkins (2010) with examples
using the Python programming language.

Notes
1 https://msdn.microsoft.com/en-us/library/ekyft91f(v=vs.90).aspx
2 http://app1.localizingapps.com
3 http://jquerymobile.com/
4 https://www.djangoproject.com/
5 https://github.com/
6 https://bitbucket.org/
7 http://www.madcapsoftware.com
8 http://www.adobe.com/ie/products/robohelp.html
9 http://idpf.org/epub
10 http://xmlsoft.org/XSLT/xsltproc2.html
11 https://help.ubuntu.com/community/DocBook#DocBook_to_PDF
12 http://sourceforge.net/projects/docbook/
13 http://www.w3.org/wiki/Its0504ReqKeyDefinitions
14 http://www.w3.org/International/questions/qa-choosing-encodings#useunicode
15 http://support.apple.com/kb/HT4288
16 https://docs.djangoproject.com/en/1.7/topics/i18n/formatting#overview
17 Similar benefits can be achieved for traditional Python programs using the Babel
internationalization library: http://babel.pocoo.org/
18 http://site.icu-project.org
19 A’ Design Award & Competition, Onur Müştak Çobanlɪ and Farhat Datta: http://
www.languageicon.org/
20 http://www.w3.org/TR/i18n-html-tech-lang#ri20040808.173208643
21 https://www.iso.org/obp/ui/#search
22 https://www.iana.org/domains/root/db
84 Internationalization
23 http://en.wikipedia.org/wiki/Generic_top-level_domain#Expansion_of_gTLDs
24 https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html
25 http://virtaal.translatehouse.org/
26 http://www.framasoft.net/IMG/pdf/tutoriel_python_i18n.pdf
27 http://docs.oracle.com/javase/tutorial/i18n/intro/steps.html
28 http://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms
29 https://docs.djangoproject.com/en/1.7/topics/i18n/translation/#pluralization
30 http://translate.sourceforge.net/wiki/l10n/pluralforms
31 http://msdn.microsoft.com/en-us/library/aa292178(v=vs.71).aspx
32 https://launchpad.net/fakelion
33 http://www.w3.org/International/articles/article-text-size
34 http://www.w3.org/
35 http://www.w3.org/TR/html-alt-techniques#sec4
36 http://www.w3.org/TR/UNDERSTANDING-WCAG20/visual-audio-contrast-text-
presentation.html
37 http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.
html#the-track-element
38 http://html5videoguide.net/code_c9_3.html
39 http://www.w3.org/TR/its20/
40 http://www.w3.org/TR/its20#potential-users
41 http://www.w3.org/TR/its20#datacategory-description
42 h t t p : / / w w w. w 3 . o r g / T R / 2 0 1 3 / R E C - i t s 2 0 - 2 0 1 3 1 0 2 9 / e x a m p l e s / x m l / E X -
allowedCharacters-global-1.xml Copyright © [20131029] World Wide Web
Consortium, (Massachusetts Institute of Technology, European Research Consortium
for Informatics and Mathematics, Keio University, Beihang). All Rights Reserved.
http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231
43 http://languagetool.org/
44 http://languagetool.org/languages/
45 https://www.languagetool.org/
46 Accessible from the book’s companion site.
47 Accessible from the book’s companion site.
48 http://www.accept-portal.unige.ch
49 http://itranslate4.eu
50 https://translate.google.com/
51 http://www.bing.com/translator/
52 http://www.diffchecker.com
53 http://languagetool.org/ruleeditor/
54 http://www.unicode.org/conference/about-conf.html
55 http://www.slideshare.net/YamagataEurope/dita-translatability-best-practices
56 http://code.google.com/p/pseudolocalization-tool
57 http://onlamp.com/pub/a/php/2002/11/28/php_i18n.html
58 http://help.transifex.com/features/formats.html
59 http://www.localisation.ie/resources/courses/summerschools/2012/WindowsPhone
Localisation.pdf
4 Localization basics

4.1 Introduction
Various steps in traditional globalization workflows were introduced in the
previous chapter, specifically in Section 3.2.3. The word traditional is used here
to refer to proven and scalable workflows, which have been used extensively by
multiple companies for the publishing of localized products in multiple languages.
An example of such workflow is shown in Figure 4.1.
The steps of this workflow include the internationalized creation
(including the marking) of source content (be it software strings or structured
documentation), the possible extraction of this content into a format that can
be easily translated into one or multiple target languages, the actual translation
of the content, the merging of the translated content back into the original
file(s) and finally some post-processing (including quality assurance testing) to
make sure that no problems were introduced during any of the previous steps.
Since internationalization was covered in the previous chapter, the present
chapter focuses on all of the localization-related steps: extraction, translation,
merging, building and testing, when applied to various content types pertaining
to an application’s ecosystem, including software content, user assistance and
information content. This chapter focuses on localization steps and processes
rather than on the translation technology tools that may be used to perform or
support the actual translation task, which will be the focus of Chapter 5.

Strings and Content Globalization

I18N Localization

1 2 3 4 5 6
Create Extract Translate Merge Build Test

Figure 4.1 Steps in traditional globalization workflow


86 Localization basics

4.2 Localization of software content


In Chapter 3 a simple Web application (NBA4ALL) was presented from an
internationalization perspective. This application has a default user interface
in English. In this section, the steps required to transform this user interface
into other languages are presented. As will be discussed briefly in Section 4.2.5,
most of these steps may be abstracted in dedicated commercial programs (such
as Alchemy Catalyst or SDL Passolo.1, 2 However, understanding in detail what
each step does seems beneficial from a translator’s perspective. Questions arising
during the translation process may result from decisions that were taken upstream
(say, during the creation or extraction of software strings), so it seems important
to provide low-level examples to explain software localization concepts.
Similarly, problems arising during a quality assurance testing step may be caused
by translation choices so understanding how to avoid them seems beneficial.

4.2.1 Extraction
By default the xgettext string extraction tool introduced in Section 3.2.3
generates a catalog file using the PO format for each file containing source code
or templates. This approach can be quite cumbersome for projects containing a
large number of files so it is often preferable to group all translatable strings into
a single package. This grouping can be achieved very easily with the Django
framework thanks to the makemessages tool, which examines every file from
a project and extracts any string that is marked for translation.3 While doing so,
it creates or updates a catalog file in a specific directory, specifically the locale\
language_code\LC_MESSAGES directory where language_code corresponds to
the language code of a particular locale (say de for German). Once catalog files
have been created, they are ready to be translated as discussed in the next section.

4.2.2 Translation and translation guidelines


The translation step can be done in a number of different ways, the most common
one involving a human translator who needs to access or download the catalog
file (say, a .po file) and use a program of their choice to provide translations
(e.g. by populating the lines starting with msgstr as explained in Section 2.5.1).
Various online and desktop tools exist to do this, including simple text editors or
advanced Translation Environment Tools (TEnT).
Using a dedicated TEnT provides the advantage of having access to
productivity features such as a translation memory or a glossary of terms, which
may have been provided alongside the catalog file as part of the translation kit.
A more detailed discussion of translation memory and terminology glossaries will
take place in Chapter 5. Very often translation conventions or rules must also
be observed when localizing software strings since these strings are at the heart
of the user experience. Translation guidelines may therefore also be included in
the transkit. Guidelines can be extremely concise, such as those provided for the
Localization basics 87
translation of Evernote or Mozilla applications.4, 5 Guidelines can be much more
detailed, as those provided for the translation of Microsoft applications.6 While
these companies have decided to make their translation guidelines available
publicly, some companies decide to keep them private because they regard them
as being valuable intellectual property material. When guidelines are public,
they tend to become well-known and used extensively in the industry, which
means experienced translators do not necessarily have to spend too much time
assimilating them at the start of a project. Proprietary guidelines, however, can be
a challenge for new translators who may have to adjust their translation habits to
comply with them. As far as software strings are concerned, translation guidelines
tend to cover the following categories:

• placeholders
• markers for hotkeys
• HTML fragments
• tone
• abbreviations
• terminology.

Placeholders correspond to the programming language-specific substitution


markers, such as %s or {0} that were introduced in Section 2.4.1. Since these
placeholders are replaced by specific values when the application is executed, it is
essential to ensure that they are preserved in the translated text. For instance, the
Evernote translation guidelines mention that hints are provided to translators
when they come across placeholders such as @{0}, @{parameterName},
%1$i, nf#.7 The best way for translators to become comfortable with such
placeholders is to become familiar with the underlying programming or markup
language used by the target application.
The second category of phenomena to pay attention to concerns the use
of markers to identify hotkeys (or accelerators). A full discussion of this topic is
provided in the next section, but specific guidelines such as avoiding the use of
‘hotkeys on letters with down-strokes like q and g’ exist to maximize the user
experience (Microsoft 2011: 54) .
The third category is HTML fragments. As explained in Section 3.2.3, Web
applications often rely on HTML templates to generate the final pages that are
shown to users. Such templates contain a mix of HTML elements and template
strings, a subset of which is made available to translators in catalog files. For
instance, the example in Listing 3.4 showed that translatable content could
be marked using trans or blocktrans constructs. These constructs usually
contain textual strings, but they may also contain HTML markup, as shown in
the following example which could appear in a string catalog file.

See what your team has been up to thanks to


<a target=”_blank” href=”http://dummysource.com”>
<img src=”http://dummysource.com/logo.png”></a>!
88 Localization basics
One characteristic of HTML markup is that it is not as strict as XML. While
an XML document must be well-formed to be manipulated by dedicated systems
(such as Web browsers), Web browsers will attempt to display HTML pages that
do not comply with the HTML standard. Having said that, loss of functionality or
visual degradations will emerge if specific HTML markup is poorly handled during
the translation process.8 In the above example, the HTML part starts with <a and
finishes with a>. This code allows the import of a clickable image in the final
HTML page. If parts of this code were to be removed or changed, it is extremely
unlikely that the image would be displayed properly in the localized page. In this
particular example, the main operation that the translator would need to do is
move this code sequence to the relevant position in the target language as word
order may be different. In some cases, however, translators may have a legitimate
reason to remove or add HTML markup if the translation environment or guidelines
allow them to do that. For instance, some HTML elements such as strong are
used to emphasize parts of a text (e.g. <strong>Warning:</strong> Do not change
the settings.).9 Depending on how emphasis is expressed in the target language, one
could envisage removing this element and use words rather than formatting.
Finally, it is worth mentioning that specific values of HTML attributes are
translatable, such as the alt attribute of img elements used to display images, as
shown below.10

See what your team has been up to thanks to


<a target=”_blank” href=”http://dummysource.com”>
<img src=”http://dummysource.com/logo.png”
alt=”DUMMYSOURCE, sports news provider”></a>!

This slightly modified example shows that the image import has been enhanced
for accessibility reasons by adding alternative text as suggested in Section 3.3.1.
In this example, the value of the alt attribute must be translated to make sense
in the target language. A similar approach has to be adopted for the values of
href attributes in hyperlink or a elements.11 In the example above, should this
text be translated into Spanish, one could consider replacing the href=“http://
dummysource.com” part with href=“http://dummysource.es” so that users are
directly brought to a relevant section of the target site (i.e. without requiring
them to make an extra selection in the global gateway of the target site). Whether
such a replacement is necessary or desirable should be clearly indicated in the
translation guidelines.
Tone is also a common area of focus in software strings’ translation guidelines
since the target user must be addressed in a consistent manner that corresponds
to their expectations. Formality levels may vary from one language to another (or
from one application to another). For instance, the Spanish translation guidelines
used at Twitter advocate referring to ‘the user as tú, not vos or usted, (…) [k]
eep[ing] the tone informal, but [without] us[ing] local or regional slang or words
that may not mean the same in all countries.’12 On the other hand, the German
translation guidelines for the Microsoft Windows phone platform recommend a
Localization basics 89
style that is both direct and personal: ‘For German, the formal second person is
to be used (Sie instead of du), as the target audience prefers to be addressed in a
formal, professional way and is not likely to want to see du all over their mobile
phone.’13
Abbreviations are especially relevant as far as mobile applications are
concerned as space constraints may require specific strings to be shortened
during the translation process. Official abbreviated forms may therefore be
provided in translation guidelines. Finally, specific guidance is likely to be
provided for application- or domain-specific terminology, be it in a specific
section of the guidelines or as a terminology glossary. While further discussion
will be provided on this topic in Section 5.4, it is worth keeping in mind that
for specific applications, technical accuracy is one, if not the most, important
characteristic of the translation process so that the user is able to navigate and use
the application as it was originally intended by the developer(s). For this reason,
Dr. International (2003: 325) reminds us that ‘without some in-depth knowledge
of the product, a localizer won’t be able to make sense of the source text, and thus
won’t be able to translate the text accurately in the target language.’ Obviously
some applications are much more technical than others, so advanced technical
skills are not always critical.

4.2.3 Merging and compilation


Once changes have been made to the messages in the catalog file, it must be
transformed into a compiled format so that it can be used more efficiently by
the (Web) application. This can be easily achieved once again using the Django
framework, this time using the compilemessages tool. Once this tool is
successfully executed, a .mo file is created for each .po file in each of the LC_
MESSAGES directories, as shown in Listing 4.1.
Once this step is completed, these resources become available for the Web
application to be viewed in the corresponding language. This language can then
be made available to users by updating the language list or global gateway as
explained in Chapter 2 in the ‘Access via the global gateway’ section. The merging
and compilation process is not always straightforward as specific technical issues
can appear once the translation has been performed. One example concerns

$ Is locale
de/LC_MESSAGES:
dj ango.mo dj ango.po

es/LC_MESSAGES:
dj ango.mo dj ango.po

f r/LC_MESSAGES:
dj ango.mo dj ango.po

Listing 4.1 Viewing the contents of a locale directory


90 Localization basics

Figure 4.2 Hotkeys in Python application using the TkInter toolkit

hotkeys or shortcut keys that are often used in desktop applications (rather than
mobile applications where touch input is favoured). Such keys are associated with
specific word letters (e.g. F from File so that a file menu can be accessed using a
mnemonic key combination such as ALT + F instead of using the mouse to click
the menu). Hotkeys can be expressed in various different ways depending on
the programming language and graphical user interface (GUI) framework used.
For instance, some applications tend to rely on the & or _ character in a string
to indicate that the following letter is a hotkey (e.g. &File). From a translation
perspective, this character needs to be preserved in the target language by making
sure that conflicts do not occur. The developer who creates strings in the source
language has to make sure that hotkey letters do not get duplicated. For instance,
if an application contains an Actions menu and an About menu, two hotkey
letters must be identified, as shown in Figure 4.2.
The minimalistic application presented in Figure 4.2 is written in Python using
the TkInter graphical user interface toolkit (which is one of the toolkits used to
build portable Python desktop applications). Even though it is very simple, this
example shows how hotkeys can be associated with different word letters in each of
the menus. To accomplish this disambiguation and avoid conflicts, the position of
specific word letters has to be specified in the source code as shown in Listing 4.2.
The code shown in Listing 4.2 is more complex than any of the examples
provided up to now in this book so some parts may be difficult to understand. This
code is presented, however, so that issues can be avoided during the translation
process itself. The lines of interest are lines 11, 12, 13 and 14. Two of these lines
are comments that indicate to translators the positions of the hotkeys in the
menu strings. Even though these positions are determined by the values of the
underline parameters on lines 12 and 14 (i.e. 0 and 1), these positions would not
be accessible to a translator who does not have access to the source code. This is
confirmed when extracting translatable strings and generating a messages.po file
as shown in Listing 4.3.
Localization basics 91
1import Tkinter
2 import sys
3from gettext import gettext as _
4
5class App(Tkinter.Tk):
6 def init (self):
7 Tkinter.Tk.__ init_(self)
8 menu.bar = Tkinter.Menu(self)
9 file_menu = Tkinter.Menu(menu.bar, tearoff=False)
10 file_menu2 = Tkinter.Menu(menu_bar, tearoff=False)
11 #Translators: hotkey is on first letter
12 menu_bar.add_cascade(label=_("Actions"), underlined, menu=file.menu)
13 #Translators: hotkey is on second letter
14 menu_bar.add_cascade(label=_("About"), underlined, menu=file_menu2)
15 file_menu.add_command(label="Quit", command=quit, accelerator="Ctrl+Q")
16 file_menu2.add_command(label="Exit", command=quit, accelerator="Ctrl+E")
17 self.config(menu=menu_bar)
18
19 self.bind_all("<Control-q>", self.quit)
20 self.bind_all("<Control-e>", self.quit)
21
22 def quit(self, event):
23 print "See you soon!"
24 sys.exit(0)
25
26 i f name == " main " :
27 app=App()
28 app.title("Hotkeys")
29 app.mainloopO

Listing 4.2 Use of hotkeys in a TkInter application

The first two lines in Listing 4.3 show the commands that are used to (i)
generate the messages.po file and (ii) view its content using the Linux tail
command so that only the last ten lines of the file are displayed. This file
contains two strings to translate with the position constraint clearly indicated
in the comments. In this particular example, a translator would have to come
up with two translations by taking into account the fixed position of the hotkey.
Finding an acceptable translation could, of course, become challenging if the
hotkey was in a position that could not be reached by the target string. This is
specifically the type of problem that would appear when testing the application,
as described in the next section. In conclusion, best practices would suggest that
it is the responsibility of the developer to clearly indicate how hotkeys should
be handled by the translators, who should in turn make sure that they follow
the recommendations that are provided either in comments or guidelines. More
complex situations can emerge as shown in lines 15, 16, 19 and 20 in Listing 4.2.
These lines are currently not marked for translation, which is why they do not
appear in Listing 4.3. A close inspection, however, reveals that they do contain
translatable strings on lines 15 and 16: Quit, Exit, Ctrl+Q and Ctrl+E). Such
strings are associated with a different type of key combination. Whereas the
92 Localization basics
$ xgettext -c tk.py
$ tail messages.po
#. Translators: hotkey is on first letter
if: tk.py:12
msgid "Actions"
msgstr ""

#. Translators: hotkey is on second letter


#: t k .p y :14
msgid "About"
msgstr ""

Listing 4.3 Use of hotkeys in a TkInter application

first example used the position of specific characters, these strings rely on the
values passed on lines 19 and 20, specifically “<Control-q>” and “<Control-e>”.
When these key combinations are used, the program is exited. In this example,
the translation process is slightly more challenging. As in the first example the
mnemonic association should ideally be preserved in the target language by
avoiding conflicts. However, there is an additional constraint since the chosen
key combination must be supported by the GUI toolkit and the environment
of target end-users. If a special key is chosen during the translation process (e.g.
a key corresponding to an accented character), problems may arise if the GUI
toolkit cannot process such keys (because it has not been fully internationalized)
or if one of the target end-users uses a different keyboard from the one used by the
translator. Again, this type of problem can be detected during a quality assurance
testing step, which is the focus of the next section.

4.2.4 Testing
When the architecture of an application does not follow internationalization
principles or best practices, unexpected problems are likely to arise during the
quality assurance process (assuming a quality assurance process is in place in
the global delivery workflow). A localization quality assurance step can also
be referred to as localization testing because it may not be sufficient to check
that translated text displays correctly in a localized application. The quality
assurance process can be broken down into several areas, including functional
testing, compliance testing, compatibility testing, load testing and localization
testing. Actually separating localization testing from other testing types can
be misleading because every aspect of an application (be it its functionality,
compliance with norms or standards, or integration with other applications) may
be impacted by the localization process. For instance, some core functionality
may be adapted during the localization process, as explained in the next chapter
in Section 6.3.3. Whenever an application undergoes such adaptation, additional
testing is required. Compliance with norms or standards may also be impacted by
the localization process because some norms may be locale-specific. Finally, the
integration with third-party services requires specialized testing when third-party
Localization basics 93
services exhibit specific characteristics. For example, an application integrating
with online banking systems may require various testing configurations depending
on the countries where the banking systems are located. Examples of tests to be
performed on a localized application may include the following:

• Making sure that the application runs as expected on target platforms.


• Making sure that translated strings are properly displayed in the application.
• If applicable, making sure that target user input can be captured and processed
by the application.
• If applicable, making sure that target language output is displayed properly
by the application.

In the example used earlier in this chapter, the NBA4ALL application would
have to be tested using a combination of operating systems and Web browsers to
ensure that the core functionality is working regardless of the combination used.
For instance, the language list should appear whenever a user clicks or touches
the language icon. The fact that the user is using a localized operating system or
Web browser should not affect this core functionality. Other types of checks are
related to user input and output. For instance the NBA4ALL application allows
users to filter items based on keywords so that any character provided by the
user should be handled correctly regardless of the language used. Testing for all
of these potential issues can be time-consuming, especially if the application is
being localized into multiple languages and if multiple updates to the source code
happen during a project as explained in Section 4.2.6.
While these aspects are crucial in releasing truly global applications,
translation-related testing often focuses on checking on the display of translated
text. As mentioned in Section 3.2.4, problems resulting from string concatenation
and expansion (or worse, lack of translation) have an immediate negative visual
impact, so it is easy to fix these first. But very often, more fundamental problems
may exist (e.g. wrong text direction for languages such as Arabic or Hebrew)
and such problems can truly affect the end-user’s experience. Assigning severity
levels to all of these issues is therefore an integral part of the localization quality
assurance process. In order to solve such problems, various types of testing can
be used, ranging from manual to fully automated. Manual testing involves going
through the various screens or pages of an application to check that translated
text displays correctly and that it is not misleading for an end-user. After all, the
translation process may have occurred in a context-free environment, so it is not
unusual to find mistranslations in localized applications, especially when strings are
short and ambiguous (e.g. does the string Share drive refer to the sharing of a drive
or to a drive containing a share). To work around translation issues originating
from a lack of context, an alternative localization approach will be presented
in Section 4.2.8. A manual testing step may also include functionality-related
checks to ensure that the application behaves according to local conventions.
For instance, if one of the application’s screens allows the user to sort some
information (e.g. in a tabular format), then the sorted results should correspond
94 Localization basics
to what’s expected in the target locale (i.e. the order should not necessarily be the
same as the one used in the source locale). Obviously such manual tests are prone
to error and extremely tedious (especially when the application changes very
often), so it is common to resort to semi-automated or fully automated testing
procedures to verify the functionality and display of localized applications. An
example of a tool that can be used to automate this process is Huxley.14 This
tool can be used to automatically monitor browser activity, taking screenshots for
each visited page and informing the user when these pages change. This means
testing can be performed on subsets of an application instead of re-testing an
application from scratch every time a new build is available. Another cloud-
based service that can be used to automate testing on multiple combinations of
platforms and Web browsers is the service offered by Saucelabs.15
Solving clipped text problems related to string expansion can be achieved
in a number of ways. As mentioned in Section 3.2.4, the best way to avoid this
type of problem is to use a responsive format that does not use fixed dimensions
in the source. If this is not possible, translations may have to be shortened by
possibly using abbreviated forms. Another option is to use custom layouts for
the target languages that require longer or shorter strings. While some European
languages will be prone to string expansion (e.g. French and German when the
source language is English), some Asian languages (e.g. Chinese) tend to be more
compact so using a one-size-fits-all approach is often sub-optimal. Custom layouts
can be created by resizing some of the User Interface elements, an approach which
was made very popular by dedicated localization tools, namely Alchemy Catalyst
and SDL Passolo, when access to the entire source code or binary is possible as
explained in the next section.

4.2.5 Binary localization


The approach presented so far in this chapter assumes that internationalization
principles have been followed to create the source application so that most of the
localization steps can be automated as discussed in Section 4.2.7, thus reducing
the number of manual fixes caused by translation-related problems. In some cases,
however, manual fixes (including layout resizing) will be required so having a
localization environment that allows for in-context translation can be beneficial.
For translators or testers to be able to view elements of the user interface that
contain translatable strings (e.g. menus or buttons), such environments must
be able to access extra source code files rather than relying solely on text-only
string catalog files, such as a messages.po file. Depending on the programming
language used, all source code files may be compiled in a single executable file.
While this is not the case for the Python code that has been presented in this
book, it is common for Windows-based applications making use of the .NET
framework. This executable file may also be referred to as a binary file, which is
why the process involving dedicated software localization tools can be described
as binary localization. This process assumes that a source binary file can be made
to the person (or team) responsible for managing the localization process so that
Localization basics 95
target, localized binary files can be generated once strings have been translated
and user interface components possibly resized. Software localization tools such
as Alchemy Catalyst or SDL Passolo also contain productivity-oriented features,
such as translation memory leveraging, spell-checking and hotkey checking in
order to speed up the overall localization process.

4.2.6 Project updates


Of course code changes can happen throughout the lifetime of a project, which
means that some strings may be updated or added during the translation of the
first batch of strings or even after the translation has taken place. This set of new
or updated strings is often known as a delta. If these string updates are frequent
and substantial, however, the localization process may be inefficient because some
translations may have been produced in vain. The agile nature of this process does
not offer the same guarantees as traditional bulk localization, a challenge that has
been highlighted by Ultan Ó Broin, who believes that ‘parts of the traditional
localization process are immediately pressured by this agility’.16 Such pressure is
due to the fact that releasing early and often to obtain user feedback will lead
to parts of a product changing dramatically (or even being dropped altogether).
Having these parts localized in a number of languages before they are validated can
therefore lead to a waste of resources. From a software publishing perspective, this
can be an issue because translation costs may be higher than originally anticipated.
From a translation provider’s perspective, this may be a good thing because these
updates may result in multiple translation projects. Obviously cost is not the only
factor that publishers should take into account when deciding when strings should
be localized. If publishers wait for all strings to be completely finalized before they
are sent for translation, the word count will be substantial, resulting in a longer
translation turnaround. Optimizing for both time and cost (while preserving
quality) is therefore often one of the responsibilities of a localization (project)
manager, who will decide when and how frequently strings should be translated.
In order to avoid having to deal with too many deltas, a string freeze approach may
sometimes be used in some projects. This approach is very strict because it means
source strings cannot be altered after a certain time. While this approach can give
localization teams an opportunity to plan the localization of the strings, it prevents
the source development teams from making changes which may be required to fix
linguistic issues (say, based on user experience feedback). Since this approach goes
against the principles advocated by the agile movement described in Section 2.1,
negotiations are often necessary to make sure that content developers either in the
source or target languages can produce quality work within a given time-frame.
Even when software strings are updated, previous translations are not always
lost. It is very often possible to recover translations from a previous project in
order to save time when translating a delta. This recovery or leverage process can
be performed in a number of ways, usually using translation memory technology
in order to find strings that have undergone little change between two translation
rounds. Segments that look similar to a new or updated segment may be retrieved
96 Localization basics
along with existing translations. These existing translations can then be used by
a translator as a starting point in order to avoid re-translating segments. Using
such previous translations will obviously only be effective for a translator if the
quality of the translations is sufficient. If this is not the case, it may be more
effective to re-translate. Deciding which resources to use for the leverage process
may be the primary responsibility of a localization (project or asset) manager, but
translators may also use their own resources to make the translation process even
more effective.

4.2.7 Automation
To conclude this section, it is worth highlighting that most of these steps are
often automated in localization workflows. Having to manually create a set of
catalog files by running a given command or having to merge translated files
into master files by running another command can be a tedious, prone-to-errors
process. For these reasons, these steps are often automated using programs or
scripts, which can be either scheduled on a regular basis (say every day at a given
time) or triggered when a specific action occurs. For instance, it is very common
to link the activities taking place within a version control system (used by
developers) to those of an online translation management system. One possible
way to fully automate this sequence of actions would be to set up the execution
of a script (for example, Django’s makemessages tool) every time changes are
validated (or committed) in the version control system used to manage the source
files of a global application. This script could also validate that all files have been
successfully generated and upload them to an online translation management
system. Another script could then monitor this translation management system
at regular intervals to check whether new translations are available, and if they
are, download the translated files and execute the compilemessages tool to
make them available to the application. Multiple variations of this set-up are
possible, but the key point is that manual touch points can easily be avoided
to speed up the localization process. A variation of this approach is to have the
translation management system monitor the version control system to detect any
file change. When a file change is detected, the translation management system
can automatically update the translation projects containing those source strings
that have been modified. Translators who usually work on these projects can then
be automatically notified that new translations are required.
A final possibility is to manage the localization workflow directly from the
environment where code is developed. This is the approach taken by localization
providers such as Get Localization or Microsoft, which offer tools to keep the
files containing translatable strings synchronized with the translated files.17, 18
Such tools give developers the ability to automatically upload resources requiring
translation to an online localization project repository. This approach presents the
advantage of eliminating a number of steps and stakeholders between developers
and translators, but may introduce unnecessary translations if the source strings
are changed regularly before an actual product release.
Localization basics 97
4.2.8 In-context localization
The previous sections have focused on a rather sequential localization model,
whereby source strings are extracted, translated and then merged back into target
resources. While this model offers certain benefits, such as scale, it also has flaws.
The first one is that many stakeholders are involved in the process, which means
that problems can occur at various points, especially if a strong quality assurance
component is not in place. The second flaw, possibly the most serious one, is that
translation often occurs out-of-context, which means that the final linguistic
quality of the content may not match customers’ expectations. Obviously some
of these linguistic problems can be resolved by having a quality assurance process
in place as well as flexible translators (who may have to re-translate strings that
have been mistranslated), but this is not as efficient as getting good translations
from the start. To work around this problem specifically for Web applications,
a new model has recently emerged whereby translatable source strings are
extracted from the rendered pages of an application using techniques such as CSS
selectors or XPath expressions (Alabau and Leiva 2014: 153). These extraction
techniques, which can be described as surface techniques, may then be coupled
with just-in-time or in-context translation tools.
For example, the Mozilla Foundation launched a project called Pontoon,
which is a Web-based, What-You-See-Is-What-You-Get (WYSIWYG)
localization (l10n) tool that can be used to localize Web content. This project is
based on open-source tools such as gettext and offers translators the possibility
to translate strings by looking at the Web page containing these strings. The
project offers an online demo site, where users can provide test translations for a
simple page.19 Figure 4.3 shows how a Web page can be split into two parts: the
content part at the top and the translation toolbar at the bottom.
The translation toolbar offers several features, including the ability to use
machine translation and translation memory tools as well as the suggestions
from other users. This toolbar can be easily minimized to navigate to parts of

Figure 4.3 Translating strings in Pontoon


98 Localization basics

Figure 4.4 Translating strings interactively in Pontoon

the page that have yet to be translated. While the toolbar can be useful to use
external tools, it does not inform translators about potential layout problems
that may result from their translations. This is where the interactive translation
functionality of Pontoon comes in. Pontoon leverages the power of HTML5,
which can easily transform any read-only element into an editable one using the
contenteditable content attribute.20 Figure 4.4 shows how a textual page element
can be clicked, edited and saved. Once the text is saved, the page displays the
updated text, which may reveal some layout issues, e.g. the text does not fit
into the original element. At this point, the translator may decide to find an
alternative, shorter translation or report the problem to the developer in order to
try to have them increase the element’s size.
More information on how to accomplish several tasks (such as publishing the
translation results) can be found on the project’s Web site.21 It remains to be
seen whether this in-context localization model will prove as successful in the
long term as it was for the localization of desktop applications as discussed in
Section 4.2.5.
To conclude this section, it is worth highlighting that it is now more and
more difficult to differentiate software strings from user assistance content since
user assistance content is sometimes embedded in the application itself. This is
specifically the case for Web applications, which can rely on graphical elements
such as tooltips or pop-ups to provide context-sensitive help. It is also possible to
include getting started information the first time an application is started so that the
user is given a quick tour of the main features of an application.

4.3 Localization of user assistance content


User assistance content includes content that can be used to better understand
the capability of an application, such as user guides, which may be defined as
‘the interface between computer systems and human users’ (Byrne 2004: 3). As
described in Section 3.1.2, user guides are often composed of multiple information
units known as topics, which can be represented as XML documents. XML is
indeed a popular format to create structured documentation and it is even used
by multiple WYSIWYG word processing applications such as Microsoft Office or
Localization basics 99
LibreOffice to save documents in a compressed manner using the Office Open
XML format or Open Document Format for Office Applications (ODF).22, 23
XML, however, is almost never used as the final documentation format that is
consumed by the user (e.g. HTML or PDF). While XML is a popular choice,
it is worth mentioning that Web-based markup languages are also gaining in
popularity. For example, the MediaWiki markup, which is used to power sites
like Wikipedia, offers a way to format text in a semi-structured manner so that
it can be parsed and transformed into HTML pages.24 While this type of markup
is not as strict as XML, it offers a quick way to write documents by focusing
on the content (rather than the formatting and layout). Another example is
the reStructuredText format, which is a popular documentation format for
Python applications.25 Such a plain-text format can not only be transformed into
HTML pages, but also into PDF or MS Word files thanks to offline tools (e.g.
Pandoc) or online ones (e.g. Read the Docs).26, 27 Source content making use
of the reStructuredText format must be annotated in a specific manner using
characters and constructs that have been assigned a specific meaning in the
format’s specifications. For instance, two lines of * characters above and below
some text are used to set off the title of a topic.
The localization process for user assistance topics shares many steps with the
process used to localize software content. The main difference between the two
processes, however, lies in the marking of translatable content. This step is clearly
defined as an internationalization step for the localization of software content,
but is somewhat fuzzier when it comes to the localization of documentation
contained in files that do not belong to the source code structure. Very often user
assistance content gets authored without any clear indication as to what should
be translated during the localization process.
Some standards and frameworks, such as OAXAL (Open Architecture for
XML Authoring and Localization), attempted in the past to provide a way to
author user assistance content with localization in mind.28 However, their usage
is not as frequent (or popular) as the systematic marking of translatable strings in
source code. One reason for this is that it is very often quite difficult for a source
content author to know what should be translated for every single targeted locale.
At best they know (perhaps based on usability studies involving focus groups)
what the users of the source locale need in terms of user assistance content. But
there is no guarantee that these needs will be matched in target locales. This may
be due to the fact that source and target markets may be structured differently,
so while an advanced user guide may be relevant in a source locale, it may be
less relevant in a target locale. As a result, the selection of relevant translatable
content often happens downstream, when the actual content gets analysed
by localization specialists. These localization specialists can rely on standards
and tools, such as the W3C Internationalization Tag Set (ITS) introduced in
Section 3.3.1 and tools, such as the ITS Tool, to determine what to translate in
XML documents and how to separate the translatable content into container
files such as PO file messages.29 While it is possible to localize XML documents
using this software string localization approach, special care must be paid to the
100 Localization basics
segmentation of the XML content. Software strings tend to be short whereas XML
topics tend to comprise several paragraphs, each of which can contain multiple
sentences. Segmentation rules may operate at the paragraph level, which means
that the smallest translatable unit may comprise multiple sentences by default
when using localization tools either from the command-line or from a graphical
user interface.30 One such application that provides a graphical user interface to
perform localization-related tasks is Rainbow, an open-source program belonging
to the Okapi Localization Toolbox suite.31 This program can be used to automate
the creation of translation kits for user assistance content.

4.3.1 Translation kit creation


Rainbow can be configured to define file processing pipelines, including
translation kit creation pipelines. In a translation kit creation pipeline, a number
of processing steps can be defined to transform an original file or set of files
into a package that may subsequently be used to obtain new translations from
translators. Such steps may be required for a number of reasons. For instance,
it may be necessary to convert specific characters present in a file (e.g. the >
character may need to be replaced by the &gt; character entity) or to convert
a file from one format to another. Such conversion may be required because of
downstream requests (e.g. a system or translator cannot handle certain characters
or file formats). Some steps may be required to address downstream preferences
(e.g. a translator may be more productive in a given translation environment, so
the resulting translation package should be specific to this environment). Other
steps may be performed to help translation providers who are working for the
first time for the content owner, therefore requiring specific assistance material
to complete the job. For instance, a list of common terms may be extracted from
the source content and included with the final translation kit. Doing this step
once during the translation kit preparation may be useful, especially if multiple
translation providers are concerned. Rather than duplicating work during the
translation process, a list can be created at the start of a project and shared among
translators throughout the duration of a project. Practical information on how to
create such a list will be provided in the next chapter in Section 5.4. Finally, some
steps may be implemented to reuse (or leverage) existing translations instead
of translating content from scratch. Once these steps have been performed,
the translation kit can be made available to translation providers, by sending
it directly to translators or by uploading it to a translation management system.

4.3.2 Segmentation
Regardless of the ultimate goal sought when creating a translation kit, the
segmentation step is extremely important. The segmentation step is used to break
the original content into smaller chunks so that the reuse from a translation
memory becomes more effective. Another role of the segmentation step is to
ensure that translatable elements are identified, by possibly relying on pre-defined
Localization basics 101
or custom filters.32 Depending on the final translation kit format used, however,
segmentation may not be consistent. For instance, at the time of writing Rainbow
did not support the segmentation of source content into PO packages.33
As explained in the next two sections, the splitting or segmenting of the
original source content can have a profound impact both on the translation
process and translation leveraging process. This means that special attention
must be paid to the way the segmentation is performed. While a naive approach
to segmentation for languages such as English would consist of using punctuation
marks such as full stops or exclamation marks followed by a space, problems
arise with abbreviations (such as Dr.) or unusual product names containing
punctuation marks (such as Yahoo!). Sentence segmentation is therefore often
dependant on the type of text that is being translated, and custom rules are often
required to adapt existing rules to new text types.
Segmentation rules can be created in a number of ways, either by using a data-
driven approach or using a rules-based approach. An example of a data-driven
approach is presented by Bird et al. (2009), whereby sentence segmentation is
handled as a classification task for punctuation. Whenever a character that could
possibly end a sentence is encountered, such as a full stop or a question mark,
a decision is made as to whether it terminates the preceding sentence.34 This
approach relies on a corpus of already segmented texts, from which characteristics
(or features) are extracted. Such characteristics may include information such as
the token preceding a given token in a sentence, whether the token following
a given token in a sentence is capitalized or not, or whether the given token is
actually a punctuation character. These features are then used to label all of the
characters that could act as sentence delimiters. Once a labelled feature set is
available, a classifier can be created to determine whether a character is likely
to be a sentence delimiter in a given context. This classifier can then be used to
segment new texts.
Segmentation rules can also be manually defined using the SRX Segmentation
Rule eXchange Standard, which is an XML-based standard that allows for the
exchange of rules from one system to another.35 SRX is defined in two parts:
a specification of the segmentation rules that are applicable for each language,
represented by the languagerules element and a specification of how the
segmentation rules are applied to each language, represented by the maprules
element. Using this standard, two types of rules can be defined: rules that identify
characters that indicate a segmentation break and rules that indicate exceptions.
For example, one breaking rule can be defined to identify a full stop followed by
any number of spaces and a non-breaking rule can be defined to list a number of
abbreviated words. Examples of such rules are presented in Figure 4.5, where the
Okapi Ratel program is used to create and test rules.36
SRX rules can be created using regular expressions conforming to the ICU
(International Components for Unicode) syntax.37 Rules will obviously vary from
one natural language to another. While a common set of rules may be reused for
certain languages, language-specific exception rules will be required. In Figure 4.5,
the English example shows how the default segmentation rules provided by the
102 Localization basics

Figure 4.5 Editing segmentation rules in Ratel

Okapi Framework segment a piece of text (using square brackets as segment


delimiters in the bottom part of the screen). The Ratel environment makes it
easy to check the impact of specific rules since rules can be disabled using the
graphical interface and segmentation changes are reloaded automatically. Being
able to identify and solve segmentation issues can be of paramount importance
in a given translation project. Segmentation issues can be of two types: the first
type is related to the example provided earlier, whereby a paragraph does not
get properly segmented in the final translation package. This problem can be
serious because it may result in the absence of leverage of existing translations
(which in turn may result in time being wasted re-translating content from
scratch). The second type of problem happens when segmentation occurs in
contexts where there should not be any segmentation. For example, if there is
no segmentation exception for the Yahoo! product name, a segmentation rule
that segments text when an exclamation mark is followed by at least one space
will over-generate in a sentence such as If you use the Yahoo! search engine, your
search results may be different., probably resulting in two segments (If you use the
Yahoo! and search engine, your search results may be different.). Again, this problem
may affect the leverage of existing translations because a sentence such as If you
use the Google search engine, your search results may be different. may be ignored.
Another consequence of this unwanted segmentation is the creation of small text
Localization basics 103
units that are not grammatically correct. When these units must be subsequently
translated, they are bound to confuse translators or machine translation systems
because they are not well-formed sentences. This problem can be exacerbated if
the translation task is performed by more than one translator. A translator may
have to translate the first part of the segment while another translator may have
to translate the second part of the segment. Such a scenario, which is similar to
the string concatenation situation that was explained in Section 3.2.4, can be
difficult to resolve since it cannot be assumed that the two resulting translations
can be merged into a grammatically correct target sentence.

4.3.3 Content reuse


Obviously user assistance content tends to be more voluminous and repetitive than
software content. This means that content reuse is a key step in the localization
process in order to avoid re-translating content. Traditionally desktop programs
tended to be updated every year or couple of years. For example, a popular office
suite such as Microsoft Office had releases such as Microsoft Office 2007, 2010
and 2013. While new functionalities were added in each of these releases, some
components, which already existed, probably underwent few changes. From a
documentation perspective, this meant that a large proportion of the existing
content could be reused in some way instead of being authored ex nihilo. From a
translation perspective, this meant that existing translation could also be reused.
The question, however, is to determine the level of granularity which should
be used to reuse content, and more particularly to identify the size of content
chunks. Specifically, should the reuse happen at the document level, at the topic
level, at the paragraph level, at the sentence level, or at the sub-sentential level?
Some argue that sentence-level reuse may affect the coherence and consistency
of a document, especially if reused sentences originate from multiple, diverse text
types produced by a number of authors (Bowker 2005). To address this issue,
context can be taken into account to make sure that the segments immediately
preceding and following a given segment are used to identify the best matching
segment for reuse. This type of segment is also referred to as an In-Context (Exact)
Match.
Others will argue that coherence and consistency issues are only problematic
when content is read in its entirety (which is not always the case with user
assistance content, as suggested Chapter 3 in Section ‘Overview of Web and
technical content writing principles’). Determining the most optimal reuse
method in a given scenario requires taking multiple variables into consideration.
These variables fall under three well-known dimensions: cost, time and quality.
From a cost perspective, variables such as the cost of the technology providing
the reuse functionality must be factored in. If one of the objectives of using
such a technology is also to reduce translation costs, then potential cost savings
should be considered. From a time perspective, the performance and scalability
of the reuse technology should be evaluated. As more content gets added to the
repository system that is queried for chunk matches, the time required to find the
104 Localization basics
matches may increase significantly. Finally from a quality perspective, the types
of checks required to ensure that the content repository does not degrade over
time will have to be defined. Depending on the size of the content to manage,
this multi-variate analysis can become extremely complex. A detailed discussion
on the various techniques and strategies for managing (multilingual) content
chunks (either documents, paragraphs or sentences) may be found in Rockley
et al. (2002). What is important to understand from a translation perspective,
however, is how various reuse approaches will affect the translation process. The
following section focuses on the most common approach for reuse, which is done
at the segment level.

4.3.4 Segment-level reuse


Segment-level reuse is a concept that can be found during the authoring process,
the localization process and the translation process. During the authoring process,
source content writers may leverage previous work by getting suggestions from
authoring technology, sometimes known as authoring memory (Allen 1999).
When the text is being written, a segment database is interrogated to check
whether similar or identical content already exists. If this is the case, suggestions
are presented to the user who may decide to select them instead of creating a
completely new segment. This approach which has been, and is still, extremely
popular in the localization world can be performed by a number of stakeholders,
including software publishers, language service providers and translators. All of
these stakeholders may decide to analyse source content before it gets translated.
The purpose of this analysis is to determine whether existing translations are
available for certain segments (especially those that are repeated often in the
content) in order to avoid having to re-translate. The main driver behind this
approach is therefore to reduce the amount of time required to translate content.
While this approach works well when the translation memory is carefully
managed, previous research has shown that inconsistencies can easily creep into
the translation memory over time (Moorkens 2011) and the productivity of
translators could be affected.
Segment-level reuse can be performed in two ways: in batch mode and in
interactive mode. In batch mode, the purpose of the analysis is to find out how
much content is reused for a given selection of source content. This step can be
useful to estimate how much time might be required to complete a translation
task. As aforementioned, this estimate will be based on a number of assumptions.
For instance, one will assume that the quality of the segments contained in the
repository used for the analysis corresponds to the quality expectations defined
for the new project. Using the Rainbow program, this reuse analysis step can be
performed by generating a scoping report.38 It is then possible to include reused
segments in the translation kit so that they can be validated or edited during the
translation process.
In interactive mode, the reuse repository may be queried for every segment.
This query may take place every time a segment becomes active, usually when
Localization basics 105
it is being worked on by a translator. Rather than producing a translation from
scratch, a number of already-produced translations will be presented to the
translator based on the similarity of the new segment with one or more segments
from the reuse repository. All translation memory systems offer the ability to set
a similarity threshold level so that segments below a given score can be ignored
and new translations produced.
A variation of this approach is to use a machine translation system to suggest a
possible translation (which again can be validated or edited). This approach was
described by Muegge (2001), who suggested using the TMX standard to export
unmatched segments, machine-translate and re-import them into a translation
memory system. In this case, a match threshold may have to be identified to
decide when to use a translation memory suggestion and when to use a machine-
translated suggestion. According to Bruckner and Plitt (2001), the use of
translation memory technology in a localization scenario leaves little room for
the deployment of MT above a threshold of 75 per cent. Using a combination
of translation memory and machine-translated (followed by post-editing)
appears to yield productivity gains. For example, Flournoy and Duran (2009: 2)
reported ‘preliminary results indicat[ing] that the MT post-editing was performed
approximately 40 per cent to 45 per cent faster than human translation’ in a large-
scale documentation localization project. Rather than setting a hard threshold
value, recent systems have tried to recommend either a machine translation
suggestion or a translation memory suggestion based on the suitability of each
suggestion from the perspective of how much editing might be required. Based
on an experimental study involving professional translators, it would appear this
approach could lead to reduced editing workloads (He et al. 2010).

4.3.5 Translation guidelines


Translation guidelines that were introduced in Section 4.2.2 for software
strings may also apply to user assistance content (e.g. handling of HTML/XML
fragments). However, specific translation guidelines exist for user assistance
content. For example, the Microsoft language style guide for French mentions
that ‘When localizing elements (…), keep in mind the fact that software and help
documents, for example, shouldn’t be handled in the same way’ (Microsoft 2011:
41).39 Obviously, guidelines are specific to the target language as some linguistic
phenomena do not exist in specific languages (e.g. the Microsoft style guide for
German contains a section on the use of the genitive, which does not apply for
other languages such as French or Italian). It is therefore impossible to cover all
types of guidelines that might be provided to translators for the translation of
user assistance content. Guidelines, however, tend to fall under the following
categories:

• capitalization
• spacing
• punctuation
106 Localization basics
• tone (formal versus familiar)
• voice (active versus passive; direct versus indirect)
• gender and articles (especially for loan words)
• compounding
• terminology.

Most of these categories are self-explanatory, so the best advice for translators
is to get familiar with the guidelines that have been defined by the translation
requester or buyer. In some situations, these guidelines are also referred to as best-
practices that may have been shaped by a community of translators. For instance,
specific conventions for the translation of Mozilla support documentation into
French include choices for using the imperative in lists of steps and the infinitive
in headings.40 It should be stressed, however, that the amount of reference
materials provided to translators can sometimes be overwhelming, especially
if some conflicts occur. For instance, Microsoft (2011: 7) lists four normative
sources to consult from a spelling and grammar perspective, advising that ‘[w]hen
more than one solution is allowed in these sources, [translators should] look for
the recommended one in other parts of the Style Guide.’ This problem can be
compounded by the fact that additional translation materials (such as translation
memory segments) provided to translators may be at odds with such guidelines.
Knowing what usage to adopt in a specific project can therefore be challenging so
checking with the translation requester is usually recommended. Locale-specific
guidelines can also be used to tone down the meaning of the source text. For
instance, Microsoft (2011: 40) warns translators that ‘[a]bsolute expressions
leaving no room for exceptions or failure, like solves all issues, fully secure, at any
time are a serious legal risk on the French, Canadian, and Belgian markets.’

4.3.6 Testing
In the same way that translated software strings require functional and visual
testing once they are merged back into an application’s code base, translated
user assistance content must be validated and tested to ensure that no problems
were introduced during the translation process (e.g. deleting important XML
elements). Examples of validation include checking that the translated files
are properly encoded, that they are well-formed and can be used to render the
final documents. Tools such as Rainbow can assist with such checks, for instance
with the validation of XML files.41 Additional testing will also be required if the
final documentation contains special components, such as an index or a search
functionality as described in the next section.

4.3.7 Other documentation components


The localization of user assistance content is not limited to the translation of the
content itself. Very often, additional documentation sections (such as glossaries or
indexes) are added to a set of topics forming a book. Locale-specific requirements
Localization basics 107
may dictate that such sections are removed, added, or modified based on cultural
expectations. While an important aspect of this work is visual and concerns
the layout of the final document (e.g. where the table of contents should be
positioned, how many layers it should contain), other aspects are related to the
language used in the document. For instance, an index in a specific locale may
require more depth than the original. In Section 2.5.2, Docbook’s indexterm
element was presented. While the original document may only have one depth
level, a localized document may require two, which means that language content
will have to be added during the localization process. Adding content means that
the size of the localized document may increase, which may result in formatting
issues depending on how much thought was given to the internationalization of
the original document. Obviously locale-specific formatting rules will be required
if locale-specific elements are used to create content.
Another important characteristic of user assistance content is that it needs to
be searchable. Even when documentation sets are broken down into individual
topics, users rarely go over a list of topics to find the one they are interested in.
Instead, users tend to rely on search engines or on the search functionality that is
sometimes made available to them in applications, allowing them to view compiled
documentation sets. Compiled documentation sets can often be found in desktop
applications by clicking a question mark icon. A search box is then presented to
the user so that queries can be entered and (hopefully) relevant document sections
returned. Without going into the details of how a search engine works, language-
specific issues are worth mentioning here since a (localized) documentation set is
only useful if the information can be found. Translating content that is not going to
be found or read seems a futile exercise. From a linguistic perspective, the following
dimensions should be taken into account:

• encoding and language detection


• word segmentation
• normalization.

First of all, the search functionality of final documentation sets should


support the encoding corresponding to an end-user’s environment. While this
requirement falls into the area of internationalization rather than localization,
language detection is a clear localization-related issue. Indeed, there is no
guarantee that the user of a localized application will perform searches in the
language used to display the user interface strings. Being able to detect that a
search query has been performed in a given language and that the user expects
results in this particular language is key. However, detecting the language of
short strings is not an easy task, since there is often not enough information
to disambiguate between various languages that share common characters and
words (Vatanen et al. 2010). For instance the search phrase email confusion is as
valid in French as it is in English.
The next dimension is the segmentation of the user query into words. Search
engines rely on indexes to match user queries with documents. Since indexes are
108 Localization basics
usually built using words, user queries must be segmented into words so that these
words can be matched with existing (relevant) documents. The final dimension
is normalization whereby word variants (either from a case or hyphenation
perspective) may be normalized to a common form to maximize the relevance
of search results. For example, if a user performs a query using the word e-mail, it
might be advantageous to return documents containing the word email since both
variants refer to the same concept. Breaking a sentence (or text segment) into
words and normalizing words are natural language processing tasks that can be
very complex depending on the language considered. These tasks will be covered
in more detail in Section 6.3.3.

4.4 Localization of information content


Information content differs from offline documentation content in a number of
ways, including the length of its life cycle and its close relationship with online
machine translation systems.

4.4.1 Characteristics of online information content


While software content (and to some extent offline documentation content)
can be relatively stable for a period of time, online information content is more
perishable. For example, information chunks such as news items or seasonal blog
posts have a shorter digital lifetime. From a translation perspective, this means
this type of content must be translated as quickly as possible before it becomes
obsolete. Using a traditional localization workflow is not always feasible because
obsolescence is reached before the translated content has been merged into target
assets. An example of this type of content is the news items that are displayed
by the NBA4ALL application. As described in Section 3.1.1, the news items are
provided by a news publisher and stored in a database for easy retrieval by the
actual NBA4ALL application. If this content were to be translated, it would have
to be made quickly available to translators. Multiple approaches, such as those
presented in Sections 4.2.2 and 4.2.8, could be used to achieve this objective.
Another approach would be to rely on machine translation, as presented in the
next section.
Another textual feature of online content lies in its use of hyperlinks to refer
users to other resources. In a technical support context, related resources may
prove invaluable in the resolution of a problem, especially if users realize that
they have been looking at the wrong document. Hyperlinks deconstruct the
traditional monolithic structure of a document, by offering users the possibility
to get access to information from several documents. As long as users can find
all the information they require, it does not matter whether they have read some
or all of one or more documents. What is, however, crucial in this approach, is
to have hyperlinks referring to target content which is in the same language as
the original content. Legitimate frustration can sometimes emanate when the
localization chain is broken in the middle of a quest for information.
Localization basics 109

4.4.2 Online machine translation


When no translation is available, users may resort to using online, generic
machine-translation systems to get a rough translation of the original content.
This scenario can be described as a zero-localization approach because the onus
to trigger the translation process may be on the user rather than on the content
provider. Informal reports suggest that some users show a pragmatic attitude
towards the capabilities of generic online systems (Somers 2003), which may
help them overcome the barriers of a communication deadlock. This is confirmed
by the increasing popularity of some systems, such as Google Translate.42 It
is, however, difficult to evaluate the usefulness of MT output for these users,
indicate Yang and Lange (2003), especially when the type of content published
is domain-specific and requires a certain level of quality. To work around this
problem, custom machine translation systems are increasingly used to pre-
translate content which may then be reviewed (or post-edited) in order to reach
acceptable quality standards. Certain software publishers, such as Microsoft
(Richardson 2004), or Dell have been providing refined machine-translated
content to the users of some sections of their Web sites.43 In such a scenario, the
use of machine translation is limited to the translation of textual content, which
means technical support documents containing embedded videos will be partly
localized using this approach.44, 45
In those cases where the machine-translated content is actually published
(instead of being translated on-the-fly), this content may become findable by
speakers of a given target language. While presenting users with an option to
machine-translate content into a language of their choice works for users who
can find the source content in the first place and judge its relevance for their
needs, it does not work for those who cannot. In some cases machine-translated
information content can be indexed by search engines, thus giving certain users
an opportunity to find this content. Once this content is found and accessed,
publishers may decide to have it reviewed or post-edited if user feedback indicates
that the content quality was not adequate. In some cases, content consumers
may also be given an opportunity to modify the content themselves in order to
improve it. Another approach to improve the quality of the machine-translated
content is to try to author the source content in a special way as mentioned in
Chapter 3 in Section ‘Controlled language rules and (machine) translatability’.
Studies, such as Roturier (2006), have shown that the comprehensibility of
machine-translated technical support content can be significantly improved
when specific authoring rules are used in conjunction with a customized rules-
based machine translation system.

4.5 Conclusions
This chapter covered a lot of ground, focusing on the localization of various
digital content types. With the advent of modern Web applications, the
distinction between software strings, documentation content and information
110 Localization basics
content is no longer always clear-cut. These types of content were reviewed in
order to detail key localization processes and introduce some of the tools that
can be used to facilitate such processes. Regardless of the content type, typical
localization processes involve three fundamental steps: extraction, translation
and merging. As shown in this chapter, modern localization processes try to
abstract most of the complexity that was characteristic of large localization
projects in the 1990s and 2000s. These modern processes tend to rely on
flexible workflows where content updates are handled continuously. In-context
localization techniques are also popular in order to minimize the amount of
quality assurance effort required to develop quality products. Such techniques
benefit the translation community, who instead of relying on isolated chunks
of words to translate, can focus on maximizing the end-user’s experience by
producing translations that fully match the context in which source strings
occur.
Documentation and information content, however, may not be limited to
textual content. As mentioned by Hammerich and Harrison (2002: 2), the term
content refers to the ‘written material on a Web site’, whereas the ‘visuals refer to
all forms of design and graphics’. This type of content will be covered in detail
in Section 6.1.

4.6 Tasks
This section is divided into three tasks:

• Localizing software strings using an online localization environment


• Translating user assistance content
• Evaluating the effectiveness of translation guidelines

4.6.1 Localizing software strings using an online localization environment


The aim of this particular task is that you become familiar with the translation of
software strings using an online localization environment. This task is composed
of four steps:

1 Finding a suitable software project


2 Getting the localizable resources
3 Creating an account for an online translation environment and creating an
online localization project
4 Translating the resources

Finding a suitable software project


There are several software code repositories available online, such as Github
or Bitbucket where open-source code may be found.46, 47 As in the previous
chapter, you should browse or search one of these repositories to find an
Localization basics 111
interesting, ideally internationalized, project.48 Obviously finding a project that
uses English as its user interface’s language is going to be easier than finding a
project in any other language than English. Ideally the software project should
contain localizable resources (preferably in PO or XLIFF format since these
formats have been covered in the present chapter). While you should be most
familiar with Python code by now, it is possible to complete this task using
another programming language (such as PHP). In order to locate a suitable
project, browse the online repository of your choice and look for a project
directory structure containing a locale directory. If you find a .po file in this
directory, this means that the strings have been externalized by the project’s
owner. You should then be able to localize this application in the next steps.
In order to find an internationalized project, you may need to rely on keywords
and wildcards when searching for projects on Github, as shown in the query
below for instance, where all repository files would be searched for the words
finance and gettext.

finance gettext repo:*/*

If you cannot find a .po file for a project that looks interesting, you should proceed
to the next step.

Getting the localizable resources


The second step consists in obtaining the localizable resource (or resources if the
project contains more than one PO or XLIFF file). There are two ways to achieve
this. You can either create an account with the online repository of your choice
and then make a copy of this project in your account. Public projects hosted on
online code repositories tend to be open-source projects so making a copy of the
code (also known as a fork) is often permitted as long as you respect the licence’s
terms and conditions. If you find this process cumbersome, you can alternatively
download a copy of the code to your computer. If you were unable to find such
a .po file in the previous task, you will need to generate it yourself using a tool
such as xgettext. Additional tips besides those provided in Section 3.4.2 can
be found online.49

Creating an account for an online translation environment and creating a


localization project
The next step involves creating an environment where the software strings can
be easily localized. There are online services that are well-suited to this task,
including Transifex, which is an online localization management service. A
free account can be easily created using their sign-up page.50 Once you have an
account on Transifex, you need to create a project and import the resources that
you would like to localize. This can be achieved by uploading your .po or .xliff
file into your project.
112 Localization basics

Translating the resources


Once the resource has been uploaded, it is possible to translate its content into
multiple languages as long as locale-specific sub-projects have been defined. For
example, once an Italian sub-project has been created, the localizable resources
can be translated into Italian. You should try to translate as many strings as you
can or want, by paying attention to the two following rules: first, you should make
sure that you do not break source code formatting by paying special attention to
placeholders as explained in Section 4.2.2. The second rule is to pay attention to
any comment that may have been left by the source code developer (providing
information about context or about possible length limitations). As shown
earlier in Listing 4.3, comments can be extremely useful to deal with hotkeys in
an effective manner. During this subtask, you should analyse whether the project
owner provided any information for you to possibly check your translations in
context. If there is no information available, can you refer to the source code itself
to check the context of a particular string? After this analysis, do you think all
strings can be safely translated? Or do you think some strings are too ambiguous
for you to provide translations in a confident manner?

4.6.2 Translating user assistance content


This task, which focuses on the localization of online technical support
documentation, is composed of two steps:

1 Finding a suitable documentation localization project


2 Translating the documentation using translation guidelines.

Finding a suitable documentation localization project


The first step is to find a user assistance localization project, which will not
require you to wait several weeks to be approved as a translation contributor.
Some organizations make their technical documentation available to online
(registered) users so that they can suggest translations in a collaborative manner.
These user-contributed submissions may then be reviewed by in-house staff or
a language service provider before they get published. An example of such a
set-up is provided by Evernote for the translation of their technical support
documentation from English into multiple target languages.51 At the time of
writing, Evernote’s online translation management environment, which was
powered by the Pootle community localization server,52 required users to be
registered before they could submit translations.53 If you do not want to register
an account with them (e.g. if you do not agree with their terms and conditions),
you can find alternative live Pootle servers that have been set up to host various
types of localization project.54 Once you are registered and logged in, you should
select an article to translate into the language of your choice (ideally one that
has not been translated before).
Localization basics 113

Figure 4.6 Online Pootle translation environment

Translating the documentation using localization guidelines and Pootle


The second step is the translation step itself. In order to translate this document,
you should review and use the translation guidelines that may be made available
by the content provider (e.g. Evernote provides a small list of guidelines).55
During this task, you should decide whether these guidelines could be extended
or modified (and if so, how). Also, you should reflect on the effectiveness of the
translation environment, which should look more or less like the one presented
in Figure 4.6.
During this task, you should identify the environment’s features that are most
useful to you as well as any missing features you think might help you translate
more effectively. Since Pootle is open-source software, you should check its
project page, which may include a list of future features.56 Would you consider
joining this project to try to get your wish list taken on-board?

4.6.3 Evaluating the effectiveness of translation guidelines


In this task you must first identify translation or localization guidelines that
have been made available online by an application publisher. Examples of such
guidelines have been provided throughout this chapter, but you may have to
perform additional searches to identify guidelines that pertain to a language
pair that you are comfortable with. If you cannot find any, you may check those
provided by Twitter.57 Once you have identified such guidelines, take some time
to review them so that you become familiar with them. As you do so, you should
reflect on whether they are consistent with your own linguistic expectations.
The second part of this task is to browse through content that was translated
into a specific target language using these translation guidelines. Based on the
suggestion to evaluate Twitter’s guidelines, you could go to their support centre
and select a relevant target language from the language list.58 You should now
take some time to scan through some of the translated documents and determine
whether the guidelines were adhered to during the translation process. Based
on your analysis of the translated content, do you think some of the guidelines
should be refined or supplemented with additional examples?
114 Localization basics

4.7 Further reading and resources


This chapter could not cover all technologies that are currently in use to
develop modern applications. Notable omissions include the Android and iOS
platforms which are extremely popular in the mobile world and for which official
localization resources can be found online.59, 60 This chapter also did not go into
detail about content management systems, which are increasingly used to power
various types of Web applications, ranging from media portal to online stores.
Examples of such systems are Drupal, Joomla or Microsoft SharePoint.61, 62, 63
Most of these systems offer a rich ecosystem of extensions, some of which being
used to create multilingual content (e.g. Lingotek Translation for Drupal).64

Notes
1 http://www.alchemysoftware.com/products/alchemy_catalyst.html
2 http://www.sdl.com/products/sdl-passolo/
3 https://docs.djangoproject.com/en/1.7/topics/i18n/translation#localization-how-to-
create-language-files
4 https://translate.evernote.com/pootle/pages/guidelines/
5 https://www.mozilla.org/en-US/styleguide/communications/translation/
6 http://msdn.microsoft.com/library/aa511258.aspx
7 https://translate.evernote.com/pootle/pages/guidelines/
8 A less permissive version of HTML, known as XHTML (Extensible HyperText
Markup Language), exists. This version will be parsed by XML processors so syntax
errors will matter.
9 http://www.w3.org/TR/html-markup/strong.html
10 http://www.w3.org/TR/html-markup/img.html
11 http://www.w3.org/TR/html-markup/a.html
12 https://translate.twitter.com/forum/forums/spanish/topics/3337
13 http://www.microsoft.com/Language/en-US/StyleGuides.aspx
14 https://github.com/facebook/huxley
15 https://saucelabs.com
16 https://blogs.oracle.com/translation/entry/agile_localization_more_questions_than
17 http://blog.getlocalization.com/2012/05/07/get-localization-sync-for-eclipse/
18 http://msdn.microsoft.com/en-us/library/windows/apps/jj569303.aspx
19 https://pontoon-dev.mozillalabs.com/en-US
20 http://www.whatwg.org/specs/web-apps/current-work#contenteditable
21 https://developer.mozilla.org/en-US/docs/Localizing_with_Pontoon
22 http://officeopenxml.com/
23 http://opendocument.xml.org/
24 http://www.mediawiki.org/wiki/Help:Formatting
25 http://docutils.sourceforge.net/rst.html
26 http://johnmacfarlane.net/pandoc/
27 https://readthedocs.org/
28 http://www.xml.com/pub/a/2007/02/21/oaxal-open-architecture-for-xml-authoring-
and-localization.html
29 http://itstool.org
30 http://manpages.ubuntu.com/manpages/gutsy/man1/xml2pot.1.html
31 http://www.opentag.com/okapi/wiki/index.php?title=Rainbow
32 http://www.opentag.com/okapi/wiki/index.php?title=HTML_Filter
33 http://www.opentag.com/okapi/wiki/index.php?title=Rainbow_TKit_-_PO_Package
Localization basics 115
34 http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html#sec-further-examples-of-
supervised-classification
35 http://www.ttt.org/oscarstandards/srx/srx20.html
36 http://www.opentag.com/okapi/wiki/index.php?title=ratel
37 http://userguide.icu-project.org/strings/regexp
38 http://www.opentag.com/okapi/wiki/index.php?title=Scoping_Report_Step
39 All Microsoft style guides are available from: http://www.microsoft.com/Language/
en-US/StyleGuides.aspx
40 https://support.mozilla.org/fr/kb/bonnes-pratiques-traduction-francophone-sumo
41 http://www.opentag.com/okapi/wiki/index.php?title=XML_Validation_Step
42 http://news.cnet.com/8301-1023_3-57422613-93/google-translate-boasts-64-
languages-and-200m-users/
43 http://www.welocalize.com/dell-welocalize-the-biggest-machine-translation-
program-ever
44 http://bit.ly/dell-alienware-us
45 http://bit.ly/dell-alienware-fr
46 https://github.com
47 https://bitbucket.org
48 The word ‘project’ is used instead of ‘product’ because of the uneven maturity level of
the code posted on these platforms.
49 http://wiki.maemo.org/Internationalize_a_Python_application#With_poEdit
50 https://www.transifex.com/signup/
51 http://translate.evernote.com/pootle/projects/kb_evernote
52 http://pootle.translatehouse.org/
53 https://translate.evernote.com/pootle/pages/getting-started/
54 http://translate.sourceforge.net/wiki/pootle/live_servers#public_pootle_servers
Note, however, that some of these projects may contain software strings projects
rather than user assistance projects
55 https://translate.evernote.com/pootle/pages/guidelines/
56 http://docs.translatehouse.org/projects/pootle/en/latest/developers/contributing.html
57 https://translate.twitter.com/forum/categories/language-discussion At the time
of writing, specific English to target language guidelines could be obtained from
this URL by clicking a language and then a link starting with Style guidelines for
translating Twitter into
58 https://support.twitter.com/
59 http://developer.android.com/resources/tutorials/localization/index.html
60 http://developer.apple.com/library/ios#referencelibrary/GettingStarted/
RoadMapiOS/chapters/InternationalizeYourApp/InternationalizeYourApp/
InternationalizeYourApp.html
61 https://www.drupal.org/
62 http://www.joomla.org/
63 http://office.microsoft.com/sharepoint/
64 https://www.drupal.org/project/lingotek
5 Translation technology

The goal of this chapter is to focus on the technology that is linked to content
translation from one language into another. Translation management systems and
translation environments are the focus of the first two sections of this chapter
since they provide most of the infrastructure required for the translation step in
localization workflows. However, it is difficult to introduce translation management
systems without presenting specific translation workflows. Very often translation
management systems provide a workflow engine used to define a series of steps that
allow content to flow up and down the localization chain. Without such systems,
translation processes tend to be inefficient. This does not mean, however, that using
such a system will guarantee smooth localization projects. If a system is chosen for
the wrong reasons or is deployed in a hasty manner without providing appropriate
support to its users, its adoption and subsequent use may lead to inefficiencies.
Understanding the main characteristics of such systems is therefore crucial for
anybody who is in charge of using or managing localization workflows.
In the third and fourth sections of this chapter, tools that are used to reuse
previous translations and handle terminology during localization processes are
covered. While terminology is at the core of most translation tasks, it is particularly
crucial in the localization of Web and mobile applications, since users tend to
interact with applications through translated strings. The fifth section of this
chapter focuses on machine translation, which is used increasingly to support,
enhance, and in some cases replace the translation step in localization workflows.
When used correctly, this controversial technology can boost translation
productivity and increase translation consistency. When used incorrectly, this
technology can have serious consequences (ranging from generating humorous
translations to producing life-threatening inaccurate translations). From a
translation buyer and translator’s perspective, it is therefore essential to know
when and how this technology should be used. The sixth section of this chapter is
dedicated to a workflow step that is closely related to machine translation: post-
editing. With the growing popularity of machine translation, post-editing is also
becoming more and more mainstream. This topic is discussed in a separate section
because it somewhat differs from the traditional act of generating a translation. A
review of post-editing tasks and tools is provided in this section. The last section
extends the section on post-editing by covering quality assurance tasks that are
Translation technology 117
performed in localization workflows, especially during the translation process.
While the concept of translation verification is not specific to localization,
localization-specific characteristics require the use of dedicated tools to ensure
that quality standards are used and adhered to throughout a localization project –
the ultimate goal being the release of quality localized applications.

5.1 Translation management systems and workflows


Translation workflows can be extremely simple. For instance, a bilingual or
trilingual application developer would be able to create a multilingual application
based on their own contributions. In this very simple scenario, one person is
responsible for releasing a multilingual application. Obviously this scenario does
not scale very well, so translation workflows can become extremely complex when
various stakeholders are involved. For instance, multiple content owners and
authors may be working on the same project, thus requiring multiple translation
providers and in-country reviewers for multiple language pairs, content formats,
volumes, translation quality requirements and delivery dates. All localization
projects fall within this spectrum, which is why numerous translation management
systems are available, each of which offering a set of features that will suit projects
of a certain type (e.g. repetitive, low volume).
In the late 1990s and 2000s, the distinction between translation management
systems (used to manage the overall translation process) and translation
environments (used to perform the actual translation process) was clear-cut.
For instance, a translation environment such as Trados Translator’s Workbench
was used primarily to translate any document that could be opened by Microsoft
Word using pre-defined templates. From a translator’s perspective, however,
this application was not used to receive or manage translation jobs received
from translation buyers. Instead, a translation management system was used
for that purpose and in the 1990s and early 2000s, that probably meant using
a combination of email and FTP servers to transfer files between stakeholders.
Gradually these systems have been replaced by large-scale dedicated (online)
systems, whose main purpose is to centralize translation management activities.
These systems proved popular with large corporations since they allowed
concurrent connections and flexible workflows to a large number of users (such
as content owners, project managers, translators, reviewers and DTP specialists).
Some of these systems were then enhanced with functionality that used to
reside on standalone (desktop-based) translation environments. The following
section presents some of the high-level characteristics that translation buyers and
translation providers may be looking for in such systems.

5.1.1 High-level characteristics


Nowadays, many systems allow translation buyers and translation providers
to perform both activities within a single system. Most of these systems allow
translation buyers to do (some of) the following:
118 Translation technology
• Generate quotes to estimate how much money they will spend on a given
project (consisting of a number of files) and how long the process will take
based on the level of quality they want to obtain (and possibly based on the
amount of translation suggestions they want to make available to speed up
the translation process).
• Upload content that should be translated, as well as any supporting materials
that may need to be used by the translator(s) during the translation
process (e.g. guidelines, translation suggestions contained in glossaries or
translation memories). This upload process may be done using an online
form, which may be manually filled in by somebody requesting a translation,
or programmatically using an Application Programming Interface (API)
call.1
• Get notifications (i) that the content has been successfully routed for
translation, (ii) when the translated content is available.
• Download and pay for the translated content if it meets the pre-defined
quality requirements.
• Provide feedback on the translations provided so that sub-standard content
can be re-routed for translation, possibly resulting in the (temporary) black-
listing of specific translation providers.
• Offer some reporting or analytics functionality in order to keep track of
translation activity in terms of volumes, speed, cost and quality. Even
if reports cannot be generated from the system itself, some data export
functionality should be available so that another application can be used
for the purpose of tracking key performance indicators using specific
metrics.

An example of a visual progress of a localization project using Transifex is


shown in Figure 5.1.
From a translation provider perspective, the system should support (some of)
the following functionality:

• Send notifications for well-defined translation jobs to translators that are


matched based on their expertise, quality standard, cost and availability.
• Indicate when payment will be available once the task is completed.
• Allow translation providers to complete the translation job online or offline
by making all source content and assistance materials available in a well-
defined format.
• Pay translators on time if the translation job has been approved by the
customer or allow translators to (temporarily) black-list translation buyers
that do not pay for completed jobs.
• Offer some reporting or analytics functionality in order to keep track of
translation activity in terms of volumes, speed, cost and quality. Even if
reports cannot be generated from the system itself, some data export
functionality should be available so that another application can be used for
the purpose of tracking key performance indicators using specific metrics.
Translation technology 119

Figure 5.1 Visualizing project progress in Transifex

As shown by these two non-exhaustive lists, it is often necessary to connect


such systems to other systems (e.g. in an automation scenario requiring a tight
system integration) or to make sure that they can receive data that was generated
offline. In such cases, these systems will rely on data exchange formats, which
may or may not be based on official standards such as XLIFF. Due to the fact that
complex localization projects can include a large number of files in a translation
kit (including translatable files, reference files, guidelines), some recent efforts
have taken place to standardize the containers that may be used in translation
and localization projects. For instance, the Linport (Language Interoperability
Portfolio) project (Melby and Snow 2013), which is at the time of writing still
under development, is trying to create an open, vendor-independent format ‘that
can be used by many different translation tools to package translation materials’.2
This project is closely related to the Structured Translation Specifications (STS)
(Melby and Snow 2013), which is a structured set of 21 parameters developed to
help describe project-level metadata such as the target audience of the translated
content as well as the intended use of the translation.3 Time will tell whether
such formats and specifications become widely used in the localization industry.
Some translation management systems have already been presented in Chapter
4 (e.g. Transifex and Pootle), but others are available as commercial or open-source
offerings in hosted or self-hosting options. Some of these systems are suitable for
small projects requiring a very small number of translators using a simple workflow
(e.g. mobile applications) while others focus on large-scale projects requiring
large volumes of words (e.g. multiple thousands) to be translated in a large
number of languages, possibly using a large number of translators and editors (e.g.
complex workflow). These systems have additional characteristics depending on
the type or ultimate goal of the project(s) for which they are used. For instance,
some systems should be as invisible as possible, limiting the number of human
120 Translation technology
interactions to a minimum. Also, a mobile application developer/publisher
that needs to localize their application may rely on a translation management
infrastructure that is tightly integrated with their development environment
(for convenience’s sake). Finally, an open-source or social application developer/
publisher may require a collaborative or crowdsourced translation system in order
to obtain translation or translation feedback from a large user base (instead of
relying purely on translations provided by professionals). These four use-cases,
which can sometimes overlap, are presented and discussed in the next four
sections.

5.1.2 API-driven translation


One example of an API-driven translation management system is Gengo.4 While
this online system offers a traditional Web interface where content owners
can manually upload files for translation, its originality lies in its public API
which allows application developers and technically-oriented individuals to use
the Gengo translation services in a programmatic manner.5 For instance, it is
possible to request translation quotes and translation jobs using a few lines of
code. This automation-oriented approach redefines the translation process as an
asynchronous call, which allows application and content developers to focus on
the development process instead of having to manage the translation process. As
soon as the translation has been completed by a number of preferred translators,
requesters are notified that the translation is available for use.

5.1.3 Integrated translation


One example of an integrated translation management system is the App
Translation Service from Google Play.6 This service is available to Android
developers who use the Google Play Developer Console to build and distribute
their mobile applications.7 This online system guides application developers who
have agreed with the terms and conditions and registered with the service in:

• Finding out in which countries the app is used even though it has not been
localized into the language that is primarily used in that country.
• Identifying similar apps that are popular in countries where their application
is not yet available.
• Selecting which target languages to translate into.
• Identifying and placing an order with a professional translation vendor who
will be able to complete the translation of their application’s strings.
• Communicating with the translator(s) to clarify any questions that may arise
during the translation process.
• Downloading the file(s) containing translated strings.

This infrastructure clearly simplifies the process of managing the translation


process from the application developer/publisher’s perspective. While certain
Translation technology 121
questions from translators may have to be answered, most of the other steps
will happen in an asynchronous manner. The application developer is able to
leverage the translated strings to build their multilingual application whenever
they become available. Once the translations are available, however, the
application developer/publisher still has to perform some localization testing to
ensure that strings are not truncated or concatenated in an incorrect way. This
system is also equipped with advanced analytics functionality to help application
developers understand how users discovered their application, which devices
they downloaded the application from and possibly how they have been using
the application.8 This section focused on a specific platform and translation
management environment (Android using Google services), because it is one of
the best integration examples. Other platform providers (e.g. Apple with iOS/
OSX and Mozilla with Firefox OS) are less particular about the way application
developers should leverage localization services.9, 10

5.1.4 Collaborative translation


Collaborative translation management systems are systems that allow multiple
translators to work on a single project, usually for a specific language pair (say
English to Italian). These systems allow source content owners or maintainers
to upload content so that it can be made available to translators that match
specific criteria. The simplest criterion is that existing translators are already
attached to a specific project, so every time content needs to be translated, they
are notified accordingly. In some cases, it is possible to allow new translators
to collaborate on a given project, so that translations can be made available
even if the usual, core translators are not available. The words usual and core
imply some form of hierarchy, which may exist in large projects. Large projects
may have one or more maintainers for a given language pair, who decide who
can provide translations, and whether these translations must be reviewed
by another translator before they can be considered final. Regardless of the
number of translators involved in such a project, the term collaborative is
appropriate when all of the parties want to achieve the same goal: produce
the best translation quality possible for the project. This type of system is
particularly popular with open-source projects since these projects often rely
on communities of volunteer translators who share a common goal: making a
given application available in as many languages as possible in order to extend
its user base. Launchpad or Transifex are good examples of such a collaborative
system since they are used for managing the translation process of the Ubuntu
Linux operating system and Disqus commenting system respectively.11, 12 This
concept of collaborative, community translation has also been embraced and
customized by for-profit organizations, such as Twitter or Facebook, who have
built their own translation centres where users can contribute translations on a
pro bono basis.13, 14 In a for-profit scenario, the term crowdsourcing is often used
due to the large size of the translation communities (e.g. Twitter boasts more
than 350,000 translators).15
122 Translation technology

5.1.5 Crowdsourcing-based translation


The Facebook quality evaluation model described by Kelly et al. (2011) includes
several steps, but only two of them involve professional translators. The first
step concerns the extraction of translatable strings from the source code, and
the second step focuses on the actual translation of these strings by volunteers
who are users of the application. Since multiple variants may be generated for a
single string, the third step involves a voting process whereby the best translation
is chosen by a group of users. Once a consensus has been achieved, professional
translators are eventually invited to make sure that these translations are globally
coherent and consistent, confirming that this hybrid model operates at two
levels: first with volunteers operating ‘at the segmental or microtextual level,
while the macrotextual level is mostly controlled by experts’ (Jiménez-Crespo
2011: 136). In this scenario, volunteers have been found more likely to focus on
small pieces of text rather than long paragraphs. As far as professional translators
are concerned, it is therefore necessary to develop effective and creative strategies
to deal with such user contributions. If the objective of the task they have been
commissioned to do is to harmonize multiple translations, finding variants and
replacing them with their preferred forms is going to be a frequent task. In this
light, the text processing techniques that have been introduced in Chapter 2
should be extremely useful. But perhaps more importantly, translators may
have to accept in some cases some of the translation choices that have been
made by the community. While these choices may not always correspond to
what translators would have produced themselves or may not fit into their own
quality framework, they are the manifestations of the community’s expectations.
Translators must therefore learn how to accept and respect such choices, which
can then be harmonized thanks to their linguistic and textual expertise. Such
harmonization will not always be straightforward since ethical dilemmas and
loyalty conflicts occur ‘between the different parties involved in translation
(revision) projects: source-text author, commissioner, translation agency, target-
text reader, translator, reviser’ (Künzli 2007: 53).
There are of course some implications with regard to the public perception of
translation if the skills required to produce translations are not highlighted in this
type of scenario. For this reason, McDonough Dolmaya (2011: 107) warns that
‘many crowdsourcing models are likely to leave only revision and consulting as
areas of paid translation-related work, which may lead to this kind of work being
seen as higher status activities than translation.’ While such risks exist, they
are unlikely to affect all content types. The Facebook model is indeed unlikely
to be suitable when a product or service must be released at a given date. As
mentioned in the previous section, longer paragraphs are likely to be neglected
by community volunteers. A recent study (Dombek 2014: 235) also found that
some volunteer translators can be annoyed ‘with the fact there [are] not many
strings which [they can] actually attempt translating’ because many software
strings contain variables (such as %s) that can be intimidating or confusing to
somebody who does not know how the string might be rendered in the final
Translation technology 123
application. So if a product component (e.g. some user assistance content)
has to be localized before the release of a product, relying on a community of
volunteers may prove problematic. However, the translation task may still be
shared among several professional translators. In this case, translation choices
would have to be harmonized again to make sure that the final content meets
the quality expectations defined at the start of the project. In order to do this,
translators may have to communicate at some stage (either with each other or
with the entity that commissioned the work) in order to check their progress.
Various communication channels may be used for this purpose, such as email,
instant messaging, or if available, in-application commenting. Being able to
clearly describe one’s problems is of course a prerequisite, which will be helped by
a translator’s knowledge of other cultures. Regardless of the type of environment
that is used to manage the translation aspect of a localization process, the actual
translation step must be conducted in a translation environment as discussed in
the next section.

5.2 Translation environment


A translation environment is an environment where the actual translation
process is being performed by a translator. Lagoudaki (2009) identifies four types
of translation environments:

• Standard text processing environments (such as Microsoft Word) for which


plug-ins have been created to bring translation-assisting functionality into
these programs
• Dedicated text processing environments, which often display the source and
target segments in a vertical or horizontal tabular way
• Translator-friendly word processors, into which text can be copied and
pasted from any file
• Native applications where source content resides.

As far as the localization of apps is concerned, the first and third environments
are unlikely to be used for reasons that have been detailed in the previous chapter.
The choice of a translation environment (or a combination of translation
environments) depends on multiple factors, including:

• Customer requirements. Whether a translator is dealing directly with a


content publisher or a language service provider, these customers may insist
on the translators using a specific translation environment. Even if the
translator is free to pick the environment of their choice, they may have to
make sure that their use of a particular application does not violate any non-
disclosure agreement they may have signed with their customer.
• Translators’ preferences. When a translator knows that they are extremely
productive in a given environment, they may be unwilling to use another
environment, especially if the translation job is small. In such a scenario, the
124 Translation technology
time (and possible cost) invested in learning and using a new environment
may not always be justified. Productivity is not the only preference that
comes into play. The terms and conditions associated with the use of a
particular application (be it Web-based or desktop-based) may conflict
with a translator’s views on how the data generated will be handled by the
system.
• Location and Internet connection speed. Using an online environment often
requires having a fast and reliable Internet connection, which makes the
translation experience as smooth as possible. Ideally translators should not
notice that they are working online. In some cases, however, some latency
may be experienced, especially when working away from one’s traditional
work environment (e.g. from a hotel room while travelling). Being in a
country that is different from one’s usual country of residence may also have
an impact on the online experience. Some online systems are run on servers
that are located in specific countries so the ease of access to these systems
will vary from one country to another, depending on how close one is to one
of the servers.

The distinction between online and offline translation environments is not


the same as Web-based versus desktop-based applications. Some desktop-based
applications have some functionality that allows them to connect to online
services such as translation memory servers or MT systems. These connections
are often required to centralize the translation work in a given repository, in order
to possibly ensure that a team of translators will benefit from each other’s work.
With the advent of hosted and cloud-based services, however, some of the tasks
that used to be exclusively performed in a desktop-based environment are now
moving to Web-based environments, which means that translators can work on
translation assignments using a Web browser instead of a standalone application.

5.2.1 Web-based
Some of the translation management systems that have already been covered
in Chapter 4 (e.g. Transifex and Pootle) have their own Web-based translation
environment.16, 17
These systems allow translators to accomplish (some of) the following tasks:

• Translate segments that have been extracted from a source content set (be it
a set of software strings or a set of help content).
• Connect to third-party systems that will provide translation suggestions, such
as dictionaries, translation memory systems, machine translation systems. If
these systems have been correctly configured, they should help make the
translation process more efficient.
• Download a translation package containing both source content and
translation suggestions to work offline.
• Upload a translation package once the work has been completed offline.
Translation technology 125

Figure 5.2 Online translation environment in Transifex

• Check their translations to help produce the quality level that meets
customers’ requirements. Checks may include the detection of spelling,
grammar or style mistakes, as well as the identification of problems that
would affect the build process (e.g. missing or broken tags, duplicated hotkey
markers).
• Get paid for the work produced.

Transifex’ online translation environment is shown in Figure 5.2, revealing


an uncluttered tabular layout that offers translators access to powerful features
such as concordance search, machine translation suggestions, revision history
and source comments.
While these systems are becoming more and more sophisticated, it would
appear they are not (yet) as popular as traditional desktop-based programs if
one is to believe the results of an informal survey conducted on the blog of a
professional translator.18 This survey was conducted to determine which translation
environment was most used by translators and most of the answers given point
to desktop-based environments (e.g. SDL Trados, memoQ, Déjà Vu, WordFast
and OmegaT) even though most of them are restricted to specific platforms (i.e.
mainly Windows and sometimes Mac). Obviously these results must be interpreted
carefully as they are not specific to the software localization industry.

5.2.2 Desktop-based
A large number of translation environment tools exist, ranging from free open-
source programs (such as Virtaal or OmegaT) to large, commercial suites such
126 Translation technology
as SDL Trados Studio.19 Some of these programs are based on a client/server
architecture, which means that the translations and translation resources that
are generated and used can be synchronized across a network. Some of the
functionality of these programs can sometimes be available to standard word
processing environments (such as MS Word), which are favoured by a number of
translators for productivity reasons (Lagoudaki 2009). One of the most common
translation features used in this manner is that of translation memory lookup,
which allows translators to translate a document in the environment of their
choice while leveraging a translation memory database, as briefly described in
the next section.

5.3 Translation memory


A translation environment is useful when its features allow translators to be
productive and effective in delivering quality translations. Translation memory
technology, which is not specific to the localization of applications, is one of
these features. A translation memory functionality is a key feature of most (if
not all) translation environment systems since it allows for the reuse of legacy
translations. As explained in Section 3.1.2, reusing content often results in
time (and cost) savings instead of creating from scratch. Translation memory
technology is an old technology that is based on the concept of sequence
matching, by comparing how similar or different two segments (or strings) are
and providing a matching score when the two strings are not exactly identical.
While issues exist with this technology, especially when translation memories are
not maintained over time (Moorkens 2011), it is unlikely professional technical
translators could do without them in today’s competitive translation landscape.
Since good translation memories allow translators to be more productive, it
would be unthinkable to tackle repetitive texts without them. When choosing
a translation memory tool, it is important to decide which type of matching
should be obtained. Should the matching be based on characters, words,
structure or meaning of the segments? While it might be mechanically easy to
edit translations originating from segments that differ in terms of punctuation
characters, case or function words, the cognitive load associated with the editing
process may increase when the segments differ semantically. Let’s consider the
following segments:

• From the “file” menu click “save as”.


• In the File menu, click on ‘Save as’ .
• Do a left click on ‘Save as’ in the File menu.
• The “file” menu and “save as” are accessible from the program.

The first three segments are semantically identical while differing in terms
of punctuation, case, lexical choice and word order. The meaning of the fourth
segment, however, is completely different from that of the other segments but
it shares many words (and sequences of words) with the first segment. From
Translation technology 127
a translation productivity perspective, leveraging the translation of the first
segment when translating the second segment seems beneficial since no (or little)
editing would be required. Leveraging the translation of the first segment when
translating the fourth segment, however, may not be as effective because of the
semantic differences that exist between the two source segments. This cognitive
challenge is likely to be enhanced when translating into morphologically rich
languages with case-based inflection. For instance, two word sequences may be
identical in a given language but different in another language depending on the
role of this word sequence (e.g. subject vs. object). Another cognitive challenge
may arise if the translation memory tool does not include any word alignment
visualization between the source and target segments. In the example above, one
of the differences between the first and fourth segments is the word click. It might
be useful for a translator to know that this word is missing from the fourth segment
when leveraging the first segment. Having access to this information (possibly
through some colour-coding visualization scheme) may help the translator decide
whether this word should be removed from the translation suggestion. However,
it might be equally (or even more) useful to know where the translation of this
word is in the translation suggestion. Having to read (or scan) the translation
suggestion to find (and possibly delete) the translation of the word click does not
seem the most efficient use of technology.
Another aspect to keep in mind when selecting a translation memory tool is
its ability to export the contents of a project so that they can be used in another
environment. While the most important parts of a translation unit are the source
and target segments, it can sometimes be useful to export additional metadata
as well (e.g. creation date of the translation unit, author of the source segment,
author of the target segment, number of times the translation unit has been
leveraged in translation projects). Exporting translation memory data is often
performed using the TMX standard that was covered in Section 2.5.2.

5.4 Terminology
This section focuses on terminology, which is at the heart of the translation
process in the localization of applications. This section is divided into the
following sections: first, the importance of terminology is discussed from a
localization perspective. The second and third sections focus on the extraction
of terminology, or more precisely the extraction of candidate terms. The fourth
section covers various ways in which translations can be acquired once candidate
terms have been validated. The final section explains how extracted terms and
their translations can be made available in terminology glossaries, which can
then be used during the translation or quality assurance process.

5.4.1 Why terminology matters


As mentioned in Section 2.5, translation kits that are made available to
translators often contain some terminology-related resources. These resources are
128 Translation technology
made available to guide the translation of specific phrases or terms with a view
to maximizing the final end-user experience. This is confirmed by the translation
guidelines that are made available to Android developers.20 These guidelines
advise developers to employ Android standard terminology as much as possible
and to create a key terminology list that can be distributed to translators.
Various issues can occur when such phrases or terms are not translated properly
during the localization process. These issues include translation inconsistencies,
inaccurate translations or inappropriate translations. Translation inconsistencies
can occur in any project of any size, when key phrases or terms (such as product
names or features names) are translated using a number of translation variants
in a given target language. Inconsistencies tend to occur in projects involving
multiple translators who do not work in a collaborative manner. If a quality
assurance step is not performed to identify and resolve such inconsistencies once
the translations have been produced, the cohesion of the final target text may be
impacted and some readers may be confused. It should be mentioned, however,
that these inconsistencies may be useful in specific situations. For instance, users
sometimes search for documents (e.g. online help) using queries based on words
or phrases. If a document consistently uses a phrase that is not expected by a
user, this user’s queries will not return any match. If the document contained
terminological inconsistencies, the user may be able to have some of their queries
matched with specific document sections.
Another type of translation issue concerns inaccurate translations, which
may occur because translators are not given enough context or guidelines.
Examples of inaccurate translations include terms or phrases that should have
remained in the source language because of legal or compatibility implications.
Some terms, such as brand names or product names, are rarely translated since
they are protected by copyright or trademark laws. Some translators, however,
may feel that brand names may have to be translated to retain some of the
connotation associated with the syllables or components that make up a given
name (such as Microsoft for example). Inaccurate translations can also be found
in projects that refer to other products or tools that have yet to be localized.
For example, Linux-based operating systems contain command-line tools that
can be executed using specific commands corresponding to common, lower-case
words, such as cut or paste. These commands cannot be executed by using a
translated word such as couper or coller in French, but inaccurate translations
may creep in if source documents referred to these commands in an ambiguous
manner (e.g. You can merge two files with paste). To some extent, this problem can
also affect those projects for which the documentation set is translated while the
software itself (the user interface) is not. In such a case, the documentation set
may refer to actions a user should conduct by interacting with the user interface.
For example, the sentence You must remove the file by clicking the Delete button in
the File action dialog. contains two UI labels (Delete and File action) that should
remain untranslated if the software is not localized. Otherwise, the user would
have to translate these phrases back into the source language when trying to
locate such labels in the user interface.
Translation technology 129
Finally, inappropriate translations can also occur in localization projects when
users expect certain terms that have not been selected by translators during the
translation process. Taking the example used earlier whereby the word email
could be translated either as e-mail or courriel in French, one can see that personal
preferences will have an impact on how appropriate a translation will be judged by
final end-users. Global software projects often use English as the source language
to develop software and documentation content. Software professionals are often
expected to have some advanced English skills to work effectively in this industry,
and for those whose native language is not English, it is not uncommon to be
tolerant to having English phrases left in the target language. This means that a
certain disconnect can occur between software professionals and translators since
translators (who have been asked to translate content) may translate terms that
end-users would prefer to read in English. An extreme example is provided by the
Adobe Globalization team, who reveal that some Russian customers preferred
reading API documentation in English rather than Russian.21
In order to avoid situations whereby terms or phrases are mistranslated or
translated inappropriately or inconsistently, some terminological work may be
performed before, during or after the translation process. Such work focuses on
the identification of terms in the source language (extraction) and possibly in the
target language (via translation or extraction). The next sections focus on the
extraction of terms from monolingual and bilingual documents.

5.4.2 Monolingual extraction of candidate terms


Before a final list of terms can be created for a content set, candidate terms must
first be extracted from the source content and possibly be reviewed by a knowledge
expert (e.g. a content developer or ideally a terminologist). In this section and
subsequent sections, an example source documentation file is taken from the
documentation guide pertaining to the JBoss application server.22 Extracting
candidate terms or phrases from source content can be performed using a number
of approaches, ranging from statistical to rule-based, and possibly including
hybrid methods. Statistical methods are used to try to identify sequences of words
that appear frequently in a given content set. The length of these sequences can
usually be defined by a user in order to limit the number of candidate sequences
to review. Statistical methods very often do not take into account linguistic
information, such as the part-of-speech of a given word (i.e. whether the word is
a noun or a verb), so lists of candidate terms may be noisy because some of these
candidate terms are not actual terms. To illustrate this approach, terms can be
extracted using Rainbow’s terms extraction feature, as shown in Figure 5.3.
As shown in Listing 5.1, however, many terms would not necessarily qualify as
terms in a localization project.
Listing 5.1 shows that some candidate terms actually occur in other candidate
terms. For example, Application Platform also occurs in the longer string Enterprise
Application Platform. While Rainbow provides an option to try to ignore terms
that appear in longer strings, it is not always easy to decide whether it is preferable
130 Translation technology

Figure 5.3 Extracting candidate terms with Rainbow

16 sentence
15 The
12 documentation
11 user
10 installation
9 Before
8 After
8 Application Platform
8 Enterprise Application Platform
8 If
8 It
8 JBoss Enterprise Application Platform
8 Notes
8 Platform
8 developers

Listing 5.1 Extracted candidate terms with Rainbow

to keep shorter strings instead of longer strings. This decision is often influenced
by the way in which terms are going to be translated in various target languages.
In order to avoid undesirable concatenation issues (whereby the translation of
term A and the translation of term B cannot be glued together to produce a
correct translation for term AB), it is sometimes preferable to keep longer terms
when validating term candidates.
Another issue can be seen in Listing 5.1, whereby some candidate terms seem
to contain unlikely terms, such as Before. This is due to the fact that Rainbow
does not use any linguistic knowledge to extract candidate terms, so the output
tends to be noisy especially if stopwords, which are (undesirable) words filtered
Translation technology 131
out before or after processing text, are not used to refine the results. Examples of
stopwords typically include function words (such as the or during) and common
content words that are not domain specific. To work around this problem more
sophisticated tools can be used in order to label each word with a part-of-speech
tag before performing the actual extraction. This is the approach that is used
by LanguageTool, which was introduced in Chapter 3 in Section ‘Language
checkers’. The extraction of candidate terms based on part-of-speech tags may
be available in commercial or open-source tools. As mentioned in Chapter 2, the
Python programming ecosystem is rich in terms of additional, focused tools that
supplement the core language. These tools are often known as libraries since they
provide specific functionality, which would take a significant amount of time
to develop from scratch. One of these libraries is the Natural Language Toolkit
(NLTK) (Bird et al. 2009). This library allows users to perform in sequence
some of the tasks that are required to extract candidate terms, including text
segmentation, sentence tokenization, part-of-speech tagging and chunking.23
These techniques allow the creation of chunk patterns to extract only those
substrings that correspond to specific sequences of tags, such as sequences of
nouns (e.g. at least one common or proper noun, either in singular or plural
form). Once these strings are extracted, a final step is required to group them in
such a way that term variants (e.g. singular and plural) are merged together before
displaying frequency information. Merging singular and plural forms of strings
can be described as a normalization process, whereby the canonical, dictionary
form of a word is used to identify variants. This process, which is known as
lemmatization, can be achieved with NLTK using the WordNet resource.24 As
shown in Listing 5.2, the candidate terms and frequencies obtained using this
approach differ substantially from those presented in Listing 5.1.
Once candidate source terms are extracted, translations must be identified if
the objective of the extraction is to provide translators with a glossary of preferred
translations. This step is described in the next section.

sentence 16
developer 8
user 8
server 7
Notes 6
documentation 6
directory 6
chapter 5
JBoss Enterprise Application Platform 5
something 5
information 5
CDs 4
test lab 4
voice 4
installation 4

Listing 5.2 Extracted candidate terms using a custom script


132 Translation technology

5.4.3 Acquisition of term translations


The acquisition of translations can occur in two complimentary ways: by
asking translators to translate validated candidate terms or by mining previous
translations. The first option has the advantage of introducing translators to
the domain if these translators are also going to be responsible for translating
the rest of the content. When this option is used, some context sentences must
be provided to translators so that accurate translations can be provided. The
disadvantage of this approach is that it may be quite slow if extensive research is
required to identify translations. Typically, translators will try to identify accurate
translations from existing resources (such as terminology databases, translation
memories or monolingual corpora in the target language). If no suitable
translation can be found, new terms may have to be coined. In order to speed
up the research process, tools can be used to extract bilingual phrase pairs from
existing sentence-aligned resources, such as translation memories. One such tool
is Anymalign, which is a self-contained Python script that can be used to extract
phrase pairs in any number of languages (Lardilleux and Lepage 2009).25
This tool uses an iterative algorithm to detect and refine the alignment of
phrase pairs. Anymalign does not perform any linguistic processing of the
bilingual or multilingual texts (e.g. part-of-speech tagging). Instead, it relies on
sub-sentential alignment probabilities to identify phrases that seem to be aligned
from one language to another based on the number of times they occur in the
input text. The longer the script runs, the better the alignment becomes since
probabilities are constantly updated. The main script may be used to create a list
of aligned phrase pairs using aligned text data files such as the documentation
set pertaining to the Linux-based KDE Software Compilation (previously known
as the K Desktop Environment, KDE), which is made available via the OPUS
corpus (Tiedemann 2012).26 The script can be run as follows:

$ python anymalign.py -t 10 KDE4.en-fr.en KDE4.en-fr.fr > any.out

For the sake of simplicity the script is used in this example with the -t option so
that the script stops after ten seconds. It is possible to let the script run for much
longer but this may not be advisable in a low-resource computing environment.
Also, the files used in this example had not been tokenized. Once the script is
run, the results can be found in a text file called any.out. This file can then be
searched to look for term translations as shown in Listing 5.3.
By default the output of Anymalign contains three values when two input files
are selected. The first and second values are translation probabilities, where the
first value is the probability of the target given the source and the second value
the probability of the source given the target. The third value (which is used to
sort the results) corresponds to an absolute frequency.
In Listing 5.3 the grep tool is used with the -P option to look for patterns
in the output file using regular expressions. Since the output may contain
phrases containing multiple words, the ^ and \t delimiters are used to narrow
Translation technology 133
$ grep -P "~developer\t" any.out
developer développeur - 1.000000 0.800000 4
$ grep -P "~server\t" any.out
server serveur - 0.941176 0.592593 16
server Serveur - 0.058824 0.041667 1
$ grep -P "~user\t" any.out
user utilisateur - 0.454545 0.277778 5
user l ’utilisateur - 0.363636 0.800000 4
user user - 0.090909 1.000000 1
user nom d ’utilisateur. - 0.090909 1.000000 1
$ grep -P "'production use\t" any.out

Listing 5.3 Searching term translations in Anymalign output

down the results. The three commands used for the terms developer, server and
user return up to three phrase pairs, with the most frequent one looking like a
good translation in the first three cases. The fourth command used for the term
production use does not return any result but this is not too surprising considering
(i) the script was run for a very short time and (ii) the data files used for the
bilingual extraction (i.e. KDE documentation) do not fully match the topic of
the file used for the monolingual extraction (i.e. JBoss). This second issue is very
common in localization projects (and more generally in translation projects)
since new terms that have never been translated before will keep appearing.
In this case, it will be the responsibility of a translator to provide a translation
(using traditional translation techniques such as borrowing or equivalence).
Defining the translation of a new, frequent term early on in a project is often
effective in order to avoid having to resolve translation inconsistencies at a later
stage.
As an alternative to the detailed process presented here, one may consider a
tool such as poterminology for the extraction of terminology from PO files.27
Ultimately one has to decide whether they are looking for a one-click solution
(which may or may not be customizable and extensible) or a framework to refine
existing approaches. While the latter is more demanding than the former in
terms of initial investment, it may pay off in the long term. Regardless of the
method chosen, it does not do any harm to have a detailed understanding of how
things work once a button is clicked.

5.4.4 Terminology repositories and glossaries


Once source terms have been extracted and validated for a given project, target
translations have been defined and once potential usage notes or comments
have been created, terminology resources must be stored in a way that will make
them available for future use and/or update. Two main strategies exist for making
terminology available to people involved in the translation process. The first
consists in giving translators access to a terminology system of record so that
terms can be consulted via a Web interface or an API lookup. The second consists
in making terminology available via an export file so that terms can be consulted
134 Translation technology
offline. As far as publishers are concerned, two options exist to build a system
of record: using a dedicated system or using a terminology platform that already
contains domain-specific terminology (e.g. EuroTermBank or TermWiki).28, 29
Regardless of the approach chosen, terminology updates must be handled
carefully. In a localization project, it is not uncommon for source or target
terminology to evolve on the publisher’s side based on marketing decisions,
trademark disputes, user research or personal preferences. Such changes must
be reflected as quickly as possible in the terminology system of record so that
translators or reviewers can be informed before their translation or review
assignments are complete. If a terminology export process is used to make
terminology available, then new files should be generated to let translators know
that new terms should be considered or that some new translations should be used.
These approaches are reflected in the way Microsoft makes terminology available
to application developers and localizers: via a Web-based portal, through an API,
and via file download for both product terms and UI strings.30, 31, 32, 33
For terminology to be of any use during the translation process, it must be
presented in a standard format. If translators had to deal with new terminology
formats from one project to another, their productivity would be affected because
a lot of time would be spent mining terminology files or entries to extract relevant
information. In order to standardize terminology transfer, various formats have
been proposed over the years. One of these standards is the XML-based Term
Base eXchange (TBX) standard that had been defined by the now defunct
Localization Industry Standards Association (LISA). While LISA has now
stopped its activity, the TBX specifications are still available online.34 The format
is also still supported by numerous applications and used by various entities to
export terminology data. For example, this is the format used by Microsoft to
export the following information for each of their terms: a terminology concept
ID, a definition, a source term, a source language identifier, a target term and a
target language identifier.
While TBX has proved a popular standard in the localization industry,
other standards and formats also exist. Some of these are more oriented toward
terminology exchange between systems (e.g. export terminology from one
machine translation system to another machine translation system). An example
of such standards is the XML-based Open Lexicon Interchange Format (OLIF)
that was developed in the 1990s.35 A more recent format falling into this category
is the text-based, tab-delimited UTX format, which originated from the Asia-
Pacific Association for Machine Translation.36 The goal of this format is for users
to create, reuse and share glossaries to improve translation quality. As far as human
translators are concerned, UTX is a compact, easy-to-build glossary that reduces
the time and cost required to check terminology. As far as terminology-based
machine translation and terminology tools are concerned, UTX can be used as
glossary data that do not require any modification. With so many terminology
formats available, it is sometimes necessary to convert data from one format to
another. While custom scripts can be designed for such purpose, some tools are
already available.37
Translation technology 135
Once glossaries have been created they can be used during the translation or
quality assurance process. A basic usage consists in having a glossary open for
reference in an application (say, a text editor, a spreadsheet or a Web browser)
while translating using another application (i.e. a translation environment
tool). An advanced usage consists in importing a terminology glossary file
into a translation environment tool so that glossary terms can be detected in
source segments during the actual translation or review process. Once again, the
selection of one or the other approach will depend on numerous factors, including
the quality of the glossary, and the usability of the terminology detection feature
provided by the translation environment tool, both from a visualization and
linguistic perspective. From a visualization perspective, terms that match glossary
terms should be highlighted in an intuitive and non-intrusive manner. From a
linguistic perspective, terms should be detected using morphological information
so that dictionary-form term entries match inflected forms. The actual process of
detecting terminology omissions can be regarded as a quality assurance task, so it
will be treated in Section 5.7.3.
This section has provided a comprehensive overview of some of the tasks
involved in handling terminology during a localization project, as well as a
detailed insight into terminology extraction. Extracting terminology and creating
terminology resources can not only be useful for human translators, but also for
MT systems. This technology is the focus of the next section.

5.5 Machine translation


Machine translation is becoming mainstream with more and more Language
Service providers using the technology as part of their process (Choudhury and
McConnell 2013). As mentioned in Section 4.4.2, machine translation is also
being used to pre-translate online content in order to provide users with a gist
of the original information. One of the main criteria for the use of machine
translation technology in any given localization project is the volume of content
that should be translated. When volumes are low, the effort involved in deploying
or using the technology often does not justify the loss in translation quality
compared with a traditional translation process that relies on using professional
translators. Large content volumes tend to appear when an application’s content
set is translated for the first time into a target language.
This trend presents both threats and opportunities to translators. On the one
hand, translation tasks that used to be performed by human translators can now
be partially or fully automated as long as the machine translation system used
generates output of acceptable quality. Achieving an acceptable level of quality,
however, can be extremely difficult since multiple factors have to be taken into
account. These include user expectations and how much effort was put into
preparing the MT system. On the other hand, the preparation (or customization)
of an MT system is a new activity that can be performed by translators who are
willing to embrace the technology. The preparation of an MT system therefore
forms the main part of this section.
136 Translation technology
The preparation of an MT system is a task that can (and possibly should) be
performed by translators with some computational linguistics expertise. While
some systems can in theory be performed in a language-agnostic way using data-
driven approaches, language experts with a good knowledge of the source and
target languages are able to identify and address linguistic issues. This is especially
true as far as rules-based machine translation systems are concerned since these
systems tend to generate translation output in a predictable manner.

5.5.1 Rules-based machine translation


Describing in detail the steps involved in creating a rules-based system has
already been covered in Arnold et al. (1994) or Barreiro et al. (2011). This
section focuses on the tasks that can be performed by translation specialists to
customize an existing machine translation system. This section focuses mainly
on the customization of proprietary systems that do not expose all of their rules
to end-users. Obviously, open-source systems such as Apertium are well suited to
full customization (Forcada et al. 2011).
The traditional architecture underlying rules-based machine translation is
often based on a three-step approach, which consists in analysing the source
text, transferring the structure and/or meaning of the source text into a target
structure, and finally generating a target text that complies with target language
conventions. The analysis step is crucial in avoiding misinterpretations of the
source text since these misinterpretations would be propagated through to the
next two steps. For this reason, the code base of commercial MT systems tends to
be analysis-heavy – 80 per cent for the SYSTRAN system (Surcin et al. 2007).
The analysis step builds on some of the techniques that were introduced earlier
in this chapter, including sentence segmentation, tokenization, part-of-speech
tagging, chunking or syntactic parsing. The transfer phase mainly relies on
dictionary entries, which map source terms and phrases with equivalent terms
and phrases in the target language. The acquisition of such resources can be
automated based on the principles outlined in the previous section. Finally the
generation module ensures the correct inflection agreement between related
words (e.g. a verb and its auxiliary) and the correct handling of specific words
(e.g. inserting words when translating from a language that does not use personal
pronouns).
The customization of a rules-based machine translation system may be necessary
when the quality of the baseline system does not meet pre-defined quality levels
(e.g. the translation output is very often incomprehensible or the translation
output always requires a lot of post-editing effort). In order to customize a rules-
based machine translation system, several options are available besides creating
new rules, including the creation of pre-processing modules, custom dictionaries
or post-processing modules. The main objective of a pre-processing module is to
prepare the input text in order to maximize the use of other modules (such as
the analysis or transfer modules). Some of the techniques covered in Chapter 3
in Section ‘Language checkers’ may be used to normalize spelling or simplify the
Translation technology 137
grammar used in the source text. When input text containing spelling mistakes
is submitted to a machine translation system, translation issues are likely to occur
unless the system is able to correct these mistakes before analysing the source
sentence. In this case a normalization dictionary can be extremely useful to correct
common spelling mistakes that would affect all target language pairs (such as teh
> the) instead of creating an incorrect entry in a language-specific dictionary (e.g.
teh > le in an English-French dictionary would not be useful because the could
also be translated as la or les depending on the context). Grammatical corrections
can also be attempted when creating a pre-processing module for a rules-based
system. For instance, it is possible to automatically change the inflection of a verb
(say in terms of person or tense) using morphological resources.
Dictionary creation is often regarded as the most important step in the
customization process of an existing general purpose MT system. While this
step can be long and demanding if extensive linguistic information is required
to create dictionary entries (e.g. providing morphological information about the
source term and/or target term), it can be sped up when clues are provided to the
system assuming the system is able to handle these clues properly (Senellart et al.
2003). Allen (2001) suggests a dictionary coding workflow that is based on the
following steps:

1 Translating the source content with a baseline machine translation system.


2 Identifying various problematic terms.
3 Creating dictionary entries for unknown words, words to preserve,
mistranslated words and short expressions.
4 Re-translating the source content with the machine translation system,
customized with the entries from step 3.

While some time will be spent on these steps, Allen (2001) argues that it is
time well spent before post-editing is actually started if translation productivity
gains are to be achieved.
Finally, a post-processing module may be used to automatically correct (or
post-edit) the output of a machine-translation system. Several approaches exist
to accomplish this goal, ranging from rules-based to statistical. The concept of
automated post-editing was first presented by Knight and Chander (1994) and
further explored by Allen (1999) with a view to fixing systematic errors committed
by an MT system. When these MT errors cannot be fixed with dictionary entries,
they may be fixed using global search and replace patterns and regular expressions
(Roturier 2009).
The statistical methods used for machine translation are briefly covered in the
next section.

5.5.2 Statistical machine translation


Several data-driven (or corpus-based) approaches to machine translation exist,
including example-based and statistical machine translation. Most current data-
138 Translation technology
driven MT systems, however, are based on the statistical paradigm whereby a
translation is generated using a probabilistic approach relying on information
extracted from existing parallel texts. Very often, if not always, these parallel
texts originate from translation memories that have been created over time by
human translators. So while the final translation step in a translation using a
Statistical Machine Translation system (SMT) is automatic, it could not happen
if human translations were unavailable. Additional details on how this approach
works can be found in Hearne and Way (2011) and Koehn (2010b).
This section provides a high-level overview of one of the SMT approaches:
the phrase-based statistical machine translation paradigm proposed by Koehn et
al. (2003). In phrase-based machine translation, the source and target alignments
are between continuous sequences of words, while in hierarchical phrase-based
machine translation or syntax-based translation, more structure is added to
the alignment. For example, a hierarchical MT system is able to learn that the
German phrase hat X gegeben corresponds to gave X in English, where the Xs
can be substituted with any German-English word pair (Chiang 2005). The
extra structure used in these systems may or may not be derived from a linguistic
analysis of the parallel data.
The phrase-based SMT paradigm relies on two separate processes: training and
decoding. In order to determine which translation (hypothesis) is the most likely
given a specific source sentence, a module of the statistical machine translation
system, known as the decoder, tries to find an optimal path among thousands of
alternative possibilities. In this paradigm the translation task becomes a search
task whereby alternative possibilities are made of individual words or groups of
words (known as phrases). These phrases are associated with probabilities that
were computed during the training process using large volumes of parallel texts.
In order to join phrases to build a candidate translation, several features are taken
into account, such as the probability of a source phrase given a target phrase or the
probability of a phrase in the target language. These two features model various
aspects of a translation: the first one tries to model the adequacy of a translation
while the second one tries to model how fluent the resulting translation might be.
These two features are the ones used in the Noisy Channel Model (Brown et al.
1993). Another feature may concern the cost of joining words or phrases that are
separated by a specific distance. From a mathematical point of view, all of these
features can be combined using a log linear approach, by summing weighed log
probabilities (Och and Ney 2002). Weights are useful to model the importance
of a given feature. For example, in a given language pair and/or domain, the
language model used to make the translation fluent may not be as important as the
translation model. These weights are usually set by using a set of sentences that
should correspond to the final translation task. For example, if the ultimate goal
is to translate a user guide pertaining to a mobile banking application, the tuning
should be performed on a set of sentences that closely match the content of such a
guide. The following sections provide more information on the various steps that
are necessary to create a phrase-based statistical machine translation system using
a framework such as the open-source Moses system (Koehn et al. 2007).
Translation technology 139

Data acquisition
The first step in building an SMT system is to determine which data to use
to create models that will be used for subsequent translations. For example, a
translator working in the pharmaceutical industry may be interested in creating
a system that will specialize in translating instruction leaflets for medicines.
As a general requirement, such a system must be able to deal with the lexical,
grammatical, stylistic and textual characteristics of this technical text type.
Using parallel data originating from a completely different domain or text type
(e.g. sports news) would therefore be almost useless since sports news terms
(and their associated translations) would be unlikely to appear in instruction
leaflets. There can be exceptions, of course, when sports news materials refer
to medicines used by athletes in certain contexts (e.g. in a doping scandal),
but in general the two domains and text types would be too different to
provide sufficient overlap. One should not forget that the phrase-based SMT
approach relies on phrases rather than structures, so phrases must have been
seen at least once if they are to be translated in the target language. Once a
precise translation scenario has been identified, it is possible to start looking for
relevant training materials.
Most (if not all) statistical machine translation systems expect a set of parallel
sentence pairs in order to compute alignment probabilities between source and
target segments. Translation memories are of course good sources to find such
sentence pairs but having access to translation memories that are large enough
to be useful can be a challenge. For example, the LetsMT! service recommends
at least 1 million parallel sentences for training a translation model and at least
5 million sentences for training a language model.38 These recommendations
are based on productivity tests showing productivity increases when larger
training sets are used (Vasiļjevs et al. 2012). Even for a freelance translator
who has worked for a number of years in a specific field of specialization, these
are quite large numbers if they only take into account the translation memories
they have built over the years. For many, including larger language service
providers or corporate users, it is therefore necessary to leverage other data
sources to supplement a default set of translation memories.
Various types of data sources exist, ranging from open to closed. Open data
sources include the aforementioned OPUS corpus. The SMT community also
organizes some translation competitions from time to time and they often make
data sets available in an open manner.39 Some of these data sets may be useful
in bulk or in parts to supplement existing translation memories. Other data
repositories operate in a closed approach, whereby data is only made available
to members (who may or may not have to pay subscription or download fees
to make use of the data). One such repository is hosted by the Translation
Automation User Society.40 This system is based on data upload from members
in order for these members to download specific data sets. Some services operate
in a hybrid manner whereby public and private translation memories are made
available (e.g. MyMemory).41 As mentioned in Section 5.3, the assumption
140 Translation technology
that translation memories contain high quality translations is not always true,
especially if translation memories are not maintained over time.

Data processing
Once relevant data has been identified, it must be converted in a format that
is compatible with the tools that will be used to build the models. In some
cases, parallel data is not available at the segment level, but at the document
level. For instance, some Web sites may contain relevant document pairs when
they have been localized into at least one target language. Some data processing
tools specialize in the acquisition of such Web sources, using some heuristics to
transform such documents into smaller parallel units as described in Smith et al.
(2013) and Bel et al. (2013).
Parallel data need to be prepared before they are used in training. This
involves tokenizing the text and converting tokens to a standard case. Some
heuristics may also be used to remove sentence pairs which seem to be
misaligned or long sentences. All of these steps are necessary to ensure that
reliable alignment probabilities are extracted from the training data. For
instance, case standardization is used to ensure that word or terms variants
do not dilute the probability of an alignment. If the training data contained
multiple variants of a source word (e.g. email and Email), probabilities
would be shared among these variants, thus possibly resulting in less reliable
translations.
Obviously some of these steps are language-dependant. For example, some
rough tokenization can be achieved for languages such as English by relying on
a small number of rules (using word spaces, punctuation marks and a small list
of abbreviations). For languages such as Chinese or Japanese, however, these
techniques will not work since these languages do not use spaces to separate
words. Instead, advanced dictionary-based word segmenters are required, which
may have an impact on the performance of the system (in terms of speed). For
languages that make heavy use of compounds (e.g. German), it is also often
preferable to use decomposition rules to make sure that good word alignment
probabilities are extracted. This is due to the fact that long, complex words tend
to appear less frequently than shorter words.

Training
The training is divided into two main parts: the training of the translation model
and the training of the language model. In order to train a translation model,
word alignment information must be extracted from sentence pairs. Once
these parallel sentences have been pre-processed, they can be word-aligned,
using a tool such as GIZA++ (Och and Ney 2003), which implements a set
of statistical models developed at IBM (Brown et al. 1993). Within the SMT
framework all possible alignments between each sentence pair are considered
and the most likely alignments are identified. These word alignments are then
Translation technology 141
used to extract phrase translations, before probabilities can be estimated for
these phrase translations using corpus-wide statistics.
The next step consists in training a language model, which is a statistical
model built using monolingual data in the target language. Since a language
model provides the likelihood that a target string is actually a valid sentence
in a given language, it offers a model of the monolingual training corpus and
a method for calculating the probability of a new string using that model. This
model is used by the SMT decoder to ensure the fluency of the translation output.
Moses relies on external toolkits for language model building, such as IRSTLM
(Federico et al. 2008) or SRILM (Stolcke 2002). One important factor to take
into account when building a language model is the maximum length of the
substrings (in terms of number of words or tokens) that should be used when
estimating probabilities. Such sequences of words (or tokens) are known as
n-grams, where n corresponds to the length of the phrase (e.g. two for a bigram). In
order to be able to differentiate between fluent and disfluent sentences, it is often
necessary to build models that rely on longer substrings from the training corpus.
While sequences of two or three words tend to be more useful than sequences
of one word, longer sequences suffer from a major problem: their frequency. It
is, however, possible to combine multiple language models built using different
string lengths in order to balance the need for flexibility and context sensitivity
(Hearne and Way 2011).

Tuning
Tuning is the slowest part of the process of building an SMT system even
though it only requires a small amount of parallel data (e.g. 2000 sentences).
This step is used to refine the weights that should be used to combine the
various features of an SMT system. In the previous section, the focus was on
two of these features: the translation model and the language model, but other
features are often used, such as a word penalty to control the length of the
target sentence (Hearne and Way 2011). The tuning process tries to solve an
optimization problem by using a set of sentences corresponding to an ideal
scenario. In this scenario, each sentence is associated with a good translation
(or possibly a set of good translations) so various weight combinations can be
tried and evaluated in order to determine the one that will produce translations
that are the closest to the reference translations. Such a technique is known
as the Minimum Error Rate Training (MERT) which was proposed by Och
(2003). The reliability of this technique is highly dependent on the method
that is used to determine whether two translations are close to one another.
As mentioned in Section 5.3, translations that are semantically equivalent or
related are not always close at a lexical level. Finding a reliable metric that
captures both meaning and structure acceptability is therefore an open research
question. This challenge is due to the fact that human evaluation itself is often
not 100 per cent reliable due to the many possible translations (Arnold et al.
1994). A number of metrics have been proposed over the years to try to address
142 Translation technology
the problem of evaluating machine translation in an automatic manner, as
discussed in the next section.

Evaluation
In order to bypass the alleged issues that are inherent to human evaluations
(i.e. cost, time), several automatic evaluation methods have been developed in
the last number of years. Most of these automatic evaluation methods focus on
the similarity or divergence existing between an MT output and one or several
reference translations. Generally the scores produced by these MT metrics are
meaningful at the corpus level (i.e. by generating a global score for a tuning
or evaluation set), rather than at the segment level. Examples of automatic
metrics include BLEU (Papineni et al. 2002), Meteor (Denkowski and Lavie
2011), HTER (Snover et al. 2006) or MEANT (Lo and Wu 2011). While
all of these metrics try to provide an assessment of the quality of translations
produced by MT systems, they are sufficiently different because they actually
capture various aspects of translation quality. For instance, BLEU focuses on
the overlap of n-grams (i.e. sequences of words) between the MT output and
the reference translations, thus being more informative about the fluency of a
translation rather than its adequacy. Meteor is a tuneable metric that, by using
external resources, tries to address some of the weaknesses of BLEU (which
relies on surface forms). These resources include synonyms, paraphrases and
stemming that are used to avoid penalizing some good translations that are not
close to reference translation from an edit distance perspective. HTER’s goal is
different since it measures the amount of editing that a human translator would
have to perform to transform an MT output into a valid reference translation
(by counting edit types such as insertions, deletions, substitutions and shifts).
Finally, MEANT evaluates the utility of a translation by matching semantic role
fillers associated with the MT output and reference translations, with a view to
capturing the semantic fidelity of a translation (instead of its lexical proximity
with a reference translation). Automatic evaluation metrics are often said to be
an inexpensive alternative to human evaluation (Papineni et al. 2002). However,
new sets of data require reference translations, which might be more expensive
to produce than performing a manual evaluation of the MT output, especially if
several reference translations are required to make the results more reliable. The
approach suggested by MEANT somewhat alleviates this requirement since it
relies on annotations provided by untrained monolingual participants. Despite
all of the research work that has been done in the area of machine translation
evaluation, no solution can provide a perfect way to gauge the quality of
individual translations. These approximations, however, can be used during the
tuning process to attribute weights to components of an SMT system and give
SMT developers a way to check whether their changes are bringing about some
improvements. While most of these tools do not have their own graphical user
interface, the Asiya Online toolkit provides an easy way to generate multiple
scores once files have been uploaded.42
Translation technology 143
In some cases, however, relying on a corpus-level score is not sufficient to
understand why an MT system generated a given sentence, or whether some
source modifications can have an impact on the MT output. In these situations,
different tools are required to visualize aligned sentences at the word level.
The X-ray tool is one of these tools, since it leverages some word alignment
information generated by the Meteor metric to identify differences between two
strings (Denkowski and Lavie 2011).

Tools
Some of these steps may seem daunting to people who are new to machine
translation. The good news is that it is now much more simple to build an SMT
system than it was at the beginning of the 2000s, thanks to the huge amount
of work that has been done by the SMT community. For instance, the Moses
framework is equipped with an automated pipeline that allows users to build and
evaluate SMT systems by running a very small number of commands.43 Some
detailed video tutorials are also available to guide users through each of the
steps that may be required to run or re-run specific commands.44 New graphical
tools, such as DoMT, have also recently emerged to hide some of the complexity
associated with some of the training, tuning and evaluation steps, at the desktop
or server level.45
Finally, cloud-based services, such as LetsMT!, KantanMT or Microsoft
Translator Hub, are now also available, almost turning the building of SMT
systems into a one-click process.46, 47, 48 The approach offered by Microsoft
Translator Hub differs from the one proposed by KantanMT and LetsMT! since
the former offers the customization of an existing, generic system while the other
two offer the creation of brand new systems. While the first approach offers
translations for generic words or phrases out-of-the-box, it is unclear how much
additional training data is required to force the translation of specific phrases
or terms. In specialized domains, it is common for some words to take on new
meanings. Occurrences of this new meaning may appear in the additional data
set that is used to customize an existing, generic system, but these occurrences
may not be sufficiently frequent to outweigh the occurrences that were used to
compute the original models. The second approach may offer more control for
the translation of specific domain terms but it is likely to suffer from a coverage
issue if the training data do not fully match the data that should be translated
with newly-built models.

5.5.3 Hybrid machine translation


To conclude this section on machine translation, it is worth mentioning that
numerous combinations of rules-based machine translation and statistical
machine translation are possible. While a true hybrid system would rely on both
rules and statistics to perform specific tasks (e.g. analysis or generation), various
system combinations offer hybrid capability with a view to improving translation
144 Translation technology
quality. Examples include using two systems in a serial manner, for example using
a rules-based machine translation to translate between a source and a target
language and then using a statistical machine translation system to refine the
target language. In this case the raw target language is translated into a refined
target language (Simard et al. 2007). Another approach consists in translating
the same input text using two different systems (e.g. a rules-based system and a
statistical system) in order to replace phrases from the output generated by the
first system with phrases from the other output (Federmann et al. 2010).
System combination, and hybrid machine translation in general, is an active
area of research and one should not underestimate the complexity associated
with such combinations. While such systems have shown to yield translation
quality gains, their deployment and use involves some investment (time, cost,
effort) that may or may not outweigh these quality gains. In most translation
scenarios using machine translation, it is therefore necessary to rely on a post-
editing step to raise the translation quality to acceptable levels. Post-editing is
the focus of the next section.

5.6 Post-editing
In a machine translation context, the term post-editing (or postediting or postedition)
is used to refer to the ‘correction of a pre-translated text rather than translation
from scratch’ (Wagner 1985: 1). This definition is complemented by that of Allen
(2003: 207), who explains that the ‘task of the post-editor is to edit, modify and/
or correct pre-translated text that has been processed by a machine translation
system from a source language into (a) target language(s)’. As mentioned in the
previous section, the translation that is produced by machine translation systems
(even customized ones) is often not of sufficient quality to be published. Some
editing is therefore required to fix some of the errors that may have been generated
or introduced by a machine translation system. While some of the translation
suggestions generated by an MT system may be perfectly acceptable translations
(i.e. preserving the meaning of the original sentence and using a fluent style in
the target language), many suggestions contain errors that would be noticed by a
native speaker of the target language. The post-editing task differs from the task
of editing translation memory matches, because the target segments proposed
by a translation memory system tend to be fluent. Reading such segments to
identify missing or extra information is therefore not too demanding from a
cognitive point of view. On the other hand, machine translation output can be
extremely disfluent, which increases the cognitive load since post-editors have
to be able to: (i) identify whether some parts of the machine translation output
are worth preserving, (ii) decide how to best transform an incorrect translation
into a correct one. This task becomes even more challenging when ‘post-editors
become so accustomed to the phrasing produced by the MT output that they will
no longer notice when something is wrong with it’ (Krings 2001: 11). In order
to guide post-editors in making editing choices, various post-editing models and
guidelines have been proposed over the years, as discussed in the next section.
Translation technology 145

5.6.1 Types of post-editing


The concept of rapid post-editing was developed in the European Commission in the
1980s in order to allow translators to perform a minimum amount of corrections
to a text. This decision was made in the context of having a large amount of
documents to make available to readers for gisting purposes. The term gisting is
used to refer to the process of being able to extract some high-level information
from a piece of text. This approach seems to be successful as long as readers agree
to accept low quality translations if those maintain ‘reasonable comprehensibility
and accuracy’ (Wagner 1985: 1). The issue with this approach, however, lies in
the fact that two readers may have a different opinion on whether a text (or text
fragment) is comprehensible. The comprehensibility of a text is obviously linked
to the background knowledge of a given reader. A reader who is familiar with a
given topic, having read numerous articles on this topic, would probably find it
easier to extract the gist of a poorly machine-translated document compared to
a reader who is reading an article on the same topic for the first time. While it
might be possible to define a target reader profile, one can see that the concept of
rapid post-editing is not as straightforward as it first seems.
In order to work around this issue, the term minimal post-editing was introduced
in the 1990s in the industrial sector to address a publishing scenario requiring
translated documentation to be disseminated to readers who may happen to be
customers (Allen 2003: 304). In this scenario, understanding the gist of a document
is no longer sufficient, so the machine-translated text must be thoroughly
corrected to ensure that no translation accuracy issues remain while making
the least amount of modifications. However, this model suffers from the same
flaws as the ones encountered with rapid post-editing. Being able to consistently
determine which edits are critical, necessary, preferential or superfluous is not
straightforward, especially when the expertise of those performing the post-
editing task varies. Allen (2003) mentions that diverging interpretations of these
criteria can lead to situations where too much or too little has been done. The
concept of minimal post-editing is supported by guidelines produced by TAUS
and CNGL in order to achieve good enough quality.49 These guidelines focus on
translation accuracy from a semantic perspective, instructing post-editors to keep
as much of the raw MT output as possible. Such guidelines may seem unnatural
to professional translators, who have been trained to always produce high quality
translations. For this reason, Allen (2001: 26) warns that: ‘[I]t is not uncommon
that post-editors who are asked to conduct Rapid or Minimal PE actually end up
making changes to a pre-translated document which are actually closer to the
Full post-editing side of the spectrum. The simple reason for this is that everyone
wants to produce the best translated document possible, and to retain translation
jobs with the same client in the future, including PE jobs.’
In order to fully avoid the confusion that is inherent with the previous
models and guidelines, the concept of full post-editing was introduced to allow
post-editors to transform MT output into high quality translations. This concept
is also supported by guidelines produced by TAUS and CNGL, based on the
146 Translation technology
principle that the spelling, grammar, punctuation and syntax should be correct
and that the text should read fine from a stylistic perspective. These guidelines
also mention that as much of the raw MT output as possible should be used. In
some cases, poorly machine-translated segments will take a long time to read,
interpret and modify before they can be turned into human quality translations.
From a post-editing perspective, it can sometimes be challenging to preserve
isolated fragments of segments (around which new text has to be inserted under
constraint) when the whole segment can be deleted using a couple of keyboard
shortcuts in order to create a new segment without any constraints.
Regardless of the model that is in use in a given post-editing project, the post-
editing task involves repetitive and tedious corrections of small mistakes (Wagner
1985). To avoid having to fix repetitive issues from one project to another (or
from one segment to another), a number of approaches for minimizing PE effort
have been proposed. Some of these approaches were presented in the previous
section, including pre-processing, system customization or combination, and
automated post-processing/post-editing. As mentioned by O’Brien (2002), these
tasks require specific skills that are not usually demanded of a translator (i.e.
ability to use macros, to code dictionaries for MT, and a positive attitude towards
MT). As far as translation professionals are concerned, this point is worth
keeping in mind when accepting post-editing jobs. For translation buyers, this is
also worth considering when selecting a post-editing vendor. The next section
focuses on the tools that can be used to perform post-editing tasks.

5.6.2 Post-editing tools


As mentioned in Chapter 4, localization can be performed in context or out
of context. The same approaches apply to post-editing: post-editing can be
performed either in the environment that is used to publish machine-translated
content or using post-editing tools within a localization workflow.
In-context post-editing can be achieved using Web-based dedicated tools that
are tightly or loosely integrated with the platform where the source and target
content is created and published. An example of such an environment is the
wikiBABEL platform (Kumaran et al. 2008), which provides a user interface and
linguistic tools for collaborative correction of the rough Wikipedia content by a
community of users. These tools assist in the creation of improved content in the
target language. A similar, but more generic, approach is used in the ACCEPT
post-editing environment (Roturier et al. 2013) and the Microsoft Collaborative
Translation Framework.50 These tools allow specific users to submit corrections
when they come across ill-formed machine-translated text on Web sites. These
corrections may then be used to display an improved version of a document to
new visitors of these Web sites.
Concerning post-editing tools within a localization workflow, the situation is
slightly more complex. Even though post-editing was introduced as a translational
activity in the 1980s, it is still an active area of research since many questions
remain unanswered (e.g. how can post-editing work be used to improve an MT
Translation technology 147
system consistently?). For this reason, many post-editing prototype environments
have been created over the years. The main goal of these dedicated post-editing
tools is to study the work of post-editors, for example by recording post-editing
actions with key-logging or eye-tracking software. Examples of such tools include
Translog II (Carl 2012), PET (Aziz et al. 2012), CASMACAT (Elming and Bonk
2012) or MateCat.51 While these tools in theory could be used to perform actual
post-editing tasks, they are often not equipped with all of the functionality that
makes professional translators productive (e.g. spell-checker, translation memory
lookup, dictionary search, concordancer, predictive translation).
According to Moorkens and O’Brien (2013), professional post-editing still
tends to be performed using desktop-based tools designed for editing human-
generated translations, such as translation memory or translation environment
tools. This is particularly true in translation workflows that integrate work from
a human post-editor in a serial process, whereby the MT system provides draft
translations which must be validated and/or edited by a post-editor. In this
scenario, there is no direct interaction between the MT system and the human
post-editor, which means that the MT system cannot immediately benefit from
human translation skills. In this scenario, the post-editor may also be prevented
from getting the best out of the MT system because they can only see one draft
translation (which may not be the best translation the MT system could have
generated if some of the post-editor’s insight had been taken into account).
An alternative to the serial workflow is the interactive machine translation
(IMT) paradigm described in Langlais and Lapalme (2002), Casacuberta et
al. (2009) and Barrachina et al. (2009). In this paradigm, an MT engine is
tightly integrated into a post-editing environment, allowing this engine to
look for alternative translations each time the post-editor modifies the MT
output. MT technology is thus used to generate target sentences, which can be
interactively accepted or edited by a human translator. The MT engine leverages
the modifications made by the post-editor to produce improved translations,
providing candidate completions of the sentence being translated. Interactive
machine translation builds on the statistical MT framework (Koehn 2010a) by
using an SMT system that automatically generates an initial translation for each
source sentence. A post-editor verifies this translation from start to finish, fixing
the first error. The SMT system then proposes a new sentence ending, after taking
the correct sentence start into account. These steps are repeated until the whole
sentence has been correctly translated. This paradigm is somewhat related to the
technique of predictive typing that can be found in some translation memory
tools. While a translation memory’s predictive typing technology leverages a
translator’s input (based on a number of characters typed) to suggest likely word
completions using reference words found in terminology or non-translatable
lists, IMT suggests sentence completions by generating alternative sentence
translations. Both approaches must take usability factors into consideration to
be useful, since providing a post-editor with a new prediction whenever a key is
pressed has been proved to be demanding from a cognitive perspective (Alabau
et al. 2012).
148 Translation technology
5.6.3 Post-editing analysis
Once post-editing has been performed, it can be useful to analyse at a surface level
what has been modified to turn the original MT suggestion into a final translation.
This analysis step can be useful whether the edits have been contributed by a
third-party or by the translators themselves. In order to analyse edits between
two documents or two sets of segments, a comparison approach must be used.
Comparing two texts can be done in a number of ways. Without delving into
the details of the algorithms available to perform such comparisons, it is worth
mentioning that a number of machine translation evaluation metrics may be
used for analysis purposes. For example, the TER metric that was introduced
in the section on machine translation can output details about the possible
number of shifts, substitutions and insertions that may have been performed to
transform one segment into another. When analysing text transformations that
may have happened during the post-editing process at a surface level, one of
the objectives is to visualize what is different between two segments. One way
to visualize this information is to rely on colours to highlight either words or
characters that appear in the final translation but that were not present in the
original suggestion. For example, the SymEval tool allows users to compare two
or three files that have been used during a translation process.52 Figure 5.4 shows
the interface that can be used to select input files (such as text files, TMX files or
XLIFF files), containing two sets of translations that should be compared. These
two sets of translations may correspond to a set of machine-translated segments
and a set of post-edited segments.
SymEval can generate an XML report that contains segment-level differences
and scores generated using the General Text Matcher evaluation metric (Turian
et al. 2003). The report highlights differences in a colour-coded manner between
two sets of segments. While this approach can be used to analyse changes performed
between a set of MT segments and their corresponding final translations, it can

Figure 5.4 The SymEval interface


Translation technology 149
also be used to compare two translations generated by two different (machine)
translators. This type of tool can therefore be used for quality assurance purposes,
which is the topic of the next section.

5.7 Translation quality assurance


Translation quality assurance in localization is a complex topic because it is not
always straightforward to determine where the difference lies between linguistic
issues and functional issues that may result from the introduction of translated strings
in a multilingual application. As mentioned in Chapter 3, internationalization
techniques can be used to ensure that the amount of quality assurance work is
minimized (or even eliminated) once an application has been localized. For
instance, the use of flexible layout dimensions in source design and code reduces the
risk of clipping UI textual strings whose length may have been increased during the
translation process. When internationalization principles are not adhered to during
the initial development process, however, the risk of having to perform additional
quality assurance work increases. Localization quality assurance work differs from
translation quality assurance work because translation-related issues (such as string
expansion) cannot always be anticipated by a translator. While it is the translator
or translation reviewer’s responsibility to ensure that produced translations meet
the quality expectations expressed by the translation buyer, it is not necessarily
the translator’s responsibility to anticipate string concatenation or clipping issues,
especially when no context has been provided. Before delving into the techniques
that can be used during the translation quality assurance process(es), it is worth
pausing for a moment to identify the actors within these processes.

5.7.1 Actors
The following actor types are among the most common ones in translation
processes:

• translation buyers
• language service providers
• translators
• translation revisers
• in-country reviewers
• translation users.

For example, the EN15038 Translation Services Standard published in 2006


by the European Committee for Standardization (CEN) lists translators and
revisers as essential actors in a certified translation process, whereas reviewers
and proofreaders are optional actors depending on what was agreed with the
translation buyer.53, 54
Obviously, professional translators who have been assigned to work on a
translation project by a customer (e.g. an application publisher or a language
150 Translation technology
service provider) are (or should be) responsible for checking the quality of their
translations. If translators are paid (fairly) to deliver high quality translations
based on specific guidelines, their translation work should be divided into one
or more translation passes, followed by a quality assurance review. The focus of
this reviewing task will be quite broad since mistranslations, typos, terminology
omissions, or file corruptions may have happened during the translation
phase(s).
If a language service provider has been commissioned for a translation task,
which they have assigned to a third-party (freelance) translator or smaller language
service provider, they are likely to conduct a translation quality assurance pass
before delivering the translations to the customer. This pass is sometimes referred
to as an editing and proofreading pass in a TEP (translation, editing, proofreading)
workflow. This pass is essential in large projects that may involve more than
one translator, especially if the translators have not communicated throughout
the project to discuss and agree on how to interpret any translation guidelines
that may have been provided. The work involved in harmonizing translations
should not be underestimated, especially if new translators are being used to work
alongside more experienced translators. As pointed out by Nataly Kelly in an
article entitled ‘Ten Common Myths about Translation’, more translators will not
result in better quality.55 The focus of the task is on content harmonization in
order to make sure that the final text is both coherent and consistent. Obviously
surface-level errors and mistranslations can still be found, but these should have
been eliminated by the translators. File validation is also a key element of this
task if multiples files have had to be merged to deliver the translated assets back
to the translation buyer.
Translation buyers may be in a position to perform some translation quality
checks, especially if they have invested time and energy in creating style guides
and terminology glossaries. The focus of this task is to ensure that the target files
are valid and not corrupted, and that the instructions have been followed, by
possibly sampling some of the translated material (instead of reviewing all of it).
Some translation accuracy checks may also be in order in this step, ideally using
domain experts who can verify the accuracy of translated content (especially if
the content is of a technical nature).
In-country reviewers may not be the ones who requested the translations in
the first place. This is especially the case in large organizations, where a global
project manager may order translations in target languages for which they have
very little or no expertise. In-country reviews tend to focus on style and language
fluency since most mistranslations should have been caught in earlier steps of a
localization workflow. Instead of working on intermediary files or tools that have
been used in these steps (e.g. XLIFF files, translation management systems), in-
country reviewers tend to check final documents or applications. The main goal
of this task is to ensure that the user experience (or look and feel) that is supported
by a number of translated strings or documents corresponds to local expectations.
Ideally, users or readers should not notice that they are consuming translated
content, so the role of the in-country reviewer is to ensure that the style of the
Translation technology 151
content will please (or even delight) the target demographic(s) for which the
content was commissioned.
Finally, translation users may be involved at some level to conduct quality
assurance tasks. It is more and more frequent for application publishers to run
beta programs during which they ask users to report any issue that may affect
the quality of a given application. While the focus of these programs is often on
functionality, user experience issues due to mistranslation can be reported. As far
as Web content is concerned, it is not uncommon for Web pages to be equipped
with feedback forms that allow users to leave comments on translated content.
While the focus of these feedback forms is usually centred around usefulness and
relevance, language issues may be reported through this channel.
The following sections focus on translation quality assurance techniques that
can be used to check the quality of translated text (rather than validating file
formats). These techniques, which will be of interest to most actors apart from
translation users, fall into four main categories: manual, rules-based, statistical,
and machine learning-based. Some checks can be based on rules to detect
translated segments whose length is at odds with the length of source segments
(e.g. a two-word segment may be extremely unlikely to be translated by a 20-word
segment given a specific language pair). Other checks can be based on statistics,
for instance to identify translated sentences whose style differs widely from the
style used in previous translations. Finally, recent work using machine learning
techniques has focused on estimating the quality of a translation (or at least
trying to predict some of its characteristics, such as how fluent it is or how much
editing would be required to make it acceptable).

5.7.2 Manual checks


While the automated testing presented in Section 4.2.4 can go a long way
in identifying functionality-related bugs, it is not adequate for identifying
mistranslation issues. To work around these, the in-context localization
approach presented in Section 4.2.8 may be used. Another approach consists
in using style guides in order to define the characteristics of a final translated
document. This approach is similar to the approach that was described in Section
4.3.5 focusing on translation guidelines. Such an approach can be effective in
defining a thorough set of instructions reviewers should follow to identify (and
possibly fix) issues that may have happened during the translation process. Very
often reviewers tend to be immersed in the target locale in order to fine-tune
translated text to match the expectations of a pre-defined audience. When
reviewers are asked to validate or evaluate translations against style guides,
they often rely on translation samples and error typologies so that errors can be
classified per category (e.g. grammar, spelling, fluency) and severity (e.g. minor
or major). A recent survey of common error typologies used in the IT domain
found that the LISA QA Model (version 3.1) was still extremely popular
(O’Brien 2014).56 For a full discussion of quality in professional translation,
readers should refer to Drugan (2014). Counting errors manually, however,
152 Translation technology
can become tedious in practice, even when a sampling strategy is used.57 This
is why various automated approaches, using either rules, statistics or machine
learning, are often used to speed up the translation quality assurance or
evaluation process.

5.7.3 Rules-based checks


Experienced translators or translation reviewers can quickly notice incorrect or
inadequate translations based on various characteristics of the target (translated)
text. For instance, an unusual number of spelling and grammar mistakes may
indicate that the translator was not a native speaker of the target language or that
the work was completed in a hurry. More advanced checks by a domain expert
may reveal that mistranslations are present in the target text, suggesting that
the translator did not have sufficient domain knowledge. Such characteristics of
the target text can be considered as violations of certain norms or conventions.
Whenever these norms are regular, rules can be defined to check that the target
text does not display specific characteristics, by possibly taking the source text
into account. Examples of such characteristics include spelling, grammar, style,
terminology and format.
Numerous free and commercial (graphical) tools exist to perform these
checks, including (but not limited to) ErrorSpy, QA Distiller, ApSIC Xbench
and CheckMate.58, 59, 60, 61 Quality checking functionality can also be integrated
in online tools, such as Transifex, as shown in Figure 5.5.
Most of these tools follow a similar number of steps:

• Define checks to perform (e.g. spelling, consistency, custom checks).


• Select files to check (e.g. a bilingual file such as a TMX file).
• Analyse the files to find violations based on pre-defined checks.
• Generate a list of violations (if any) found in the file(s).
• Give the user the possibility to modify the file(s) and/or adjust the check
settings before repeating the analysis.

As shown in Figure 5.5, some of the checks are specific to user interface
strings. For instance, some checks look for the presence of new line characters
of variable sequences in the translation. Other checks relate to characteristics of
markup content, such as the presence of URLs or HTML tags. These checks can
be crucial because the absence of such entities in the translation may result in
reduced functionality, or worse, in a broken application.
This process can be extremely useful to identify those files that contain high
priority violations. These tools usually give users the possibility to define the
severity of the problems based on their requirements, making it easy to select only
those checks that are relevant for a given project. Examples of checks include
repetitions of words or spaces, corrupted characters, differences in terms of inline
codes or tags, or missing translations. Figure 5.6 shows a list of violations obtained
with the CheckMate tool.62
Translation technology 153

Figure 5.5 Selecting check options in Transifex

Very often rules must be tweaked to deal with domain or project charac
teristics in order to avoid false positives. The process used to adjust rules is
similar to the one described in Section 3.4.6. For instance, CheckMate can be
configured to leverage the rules offered by LanguageTool. Instead of checking
text in a monolingual context, translated texts can be checked using bilingual
rules, such as rules detecting false friends only when both the source and the
target contain the false friends terms.63 Such bilingual checks can be extremely
powerful in order to detect violations in a context-sensitive manner (e.g. a
translated sentence must not contain the phrase XYZ if the source sentence
contains W). CheckMate also gives the user the possibility to remove pre-
defined patterns and to create new ones using regular expressions, as shown in
Figure 5.7.
This approach can be extremely useful to check file formats that may be using
specific patterns. For example the reStructuredText format in Section 4.3 uses a
notation that may not be covered by existing checking tools out-of-the-box.64
Source content written using this format, however, may have to be translated
Figure 5.6 Checking a TMX file with CheckMate

Figure 5.7 Configuring CheckMate patterns using regular expressions


Translation technology 155
using an internationalization mechanism such as gettext.65 During the
translation process, it can be easy to break inline tags so having an automated
way to check for their presence in the target text during a revision phase can be
useful.
While some graphical tools can be extended using custom patterns, complex
checks sometimes require the creation of small scripts. This is especially the case
when reference files or systems must be consulted to perform the checks in batch
mode (i.e. when checking a large number of files). For instance, it is common
to check that the translation of User Interface strings is consistent across an
application (whether the UI strings are present in the source code or referenced
in the accompanying documentation content). Validating the consistency of
UI strings in documentation is obviously much easier when these strings have
been clearly marked in the source content, but this is not always the case. In
this situation, some heuristics may be required to extract and validate them as
described in Roturier and Lehmann (2009).
While rules-based checks can be powerful, they have to be hand-crafted,
which can be quite tedious. An alternative to this approach is to rely on statistics.

5.7.4 Statistical checks


Another translation quality assurance approach consists in using statistics to
determine whether the style of a translated text is similar or consistent with
the style of previous translations. As mentioned in the section on machine
translation evaluation metrics, various methods exist to define the stylistic
similarity or consistency between two texts. These methods range from
counting the occurrences (and overlap) of n-grams (either at the word or
character level) in the newly translated text and the previously translated
texts. Other approaches compute similarity by transforming texts into word
vectors and using distance metrics such as cosine. Such methods tend to be
computationally expensive, which is why they tend to be provided as an online
service, such as Review Sentinel by Digital Linguistics.66 This type of service
allows reviewers to focus on sections of translated documents that look the least
like existing (translated) documents, based on the assumption that the stylistic
differences are probably good indicators of translations that do not conform to
stylistic conventions.

5.7.5 Machine learning-based checks


A third method that can be used during a translation quality assurance process
is one that is based on supervised machine learning techniques, namely quality
estimation or confidence estimation (Blatz et al. 2004). In natural language
processing, quality estimation involves predicting a nominal or numerical
value without having access to a reference translation. From a translation
perspective, this means trying to predict a specific characteristic of a translation
(for example, how comprehensible it might be for a given audience, how fluent
156 Translation technology
it might be, or how much post-editing effort might be required to turn it into
an acceptable version), by taking into account various elements, known as
features. The general process is similar to the one that was described in Section
4.3.2 focusing on identifying sentence boundaries in text.
The features used in translation quality estimation tend to fall into multiple
categories, ranging from linguistic or statistical characteristics of the source text
and the target text (e.g. number of source and target words, probability of a source
text given a language model), to translator-dependant information. Examples
of features are available from the Quest quality estimation framework.67, 68, 69, 70
Translator-dependant information varies depending on whether the translation
was generated by a machine-translation system or a human translator. For
example, the time taken to produce a translation may not be as relevant for a
machine-translation system as it is for a human translator. These features are
perceived as predictive parameters that can then be combined with machine
learning methods to estimate binary (e.g. true or false to answer a question such
as is this a good translation?) or multi-class (e.g. to answer a question such as how
comprehensible is this translation on a scale of 1 to 5?), or continuous scores (e.g.
to answer the question what is the BLEU score of this translation?). Translation
quality estimation has mostly focused on translations generated by MT systems,
first at the word level and then at the sentence level.
Once features have been identified, they need to be extracted or computed
from the training data, which include the actual translation pairs, potential
additional metadata, as well as the values or labels that should be predicted
by the estimation system. Feature extraction can be a slow and complex
process depending on how difficult it is to compute a particular value. While
counting the number of punctuation characters in source and target texts is
reasonably straightforward, calculating perplexity scores may involve creating
language models in the first place. Perplexity, which is a term originating from
information theory, refers to how well a probability model is able to predict a
sample. When language models are evaluated on test samples, a high perplexity
score indicates how surprised the language model is (e.g. possibly because of
the presence of unknown or out-of-vocabulary words). Once all features have
been extracted (and possibly normalized to avoid range inconsistencies), they
can be passed to a machine-learning algorithm to build a prediction system. As
far as learning algorithms are concerned, several have been tried, with support
vector machine and decision tree learning proving popular (Callison-Burch et
al. 2012). Once a model is built, it can be used on new data to predict a label,
class or score once the new data have been transformed into feature values that
correspond to what the model expects.
While toolkits such as Quest exist to make the feature extraction and
prediction steps as transparent as possible, these tools are not as mature as other
open-source translation tools, such as OmegaT, Moses or Apertium. This is
partly due to the fact that quality estimation tools may rely on external tools
for the feature extraction step (e.g. a language model tool such as IRSTLM)
or the learning and prediction step (e.g. scikit-learn). With the increasing
Translation technology 157
interest generated by machine translation quality estimation (as shown by the
availability of shared tasks from 2012 to 2015) and the emergence of cross-
industry evaluation frameworks such as the Dynamic Quality Framework, more
user-friendly and robust tools and standards are likely to appear in the near
future.71

5.7.6 Quality standards


Standards in translation quality have always been difficult to define due to
the amount of subjectivity associated with the task. Esselink (2000: 456)
explains that ‘everybody agrees that a quality translation has to be accurate
and consistent as far as terminology, writing style and format is concerned.
However, not many standard metrics for assessing translation quality have been
developed and applied as a standard world-wide.’ Whereas the automotive
industry can rely on the industry-standard J2450 metric, quality standards, or
rather frameworks, are more fragmented in the software localization industry.72
For a long time the LISA QA model was used as mentioned in Section
5.7.2. This model was based on a basic error classification taxonomy that
was associated with different severity levels. With the recent demise of the
LISA organization, however, it is not clear how useful this model will be going
forward. Besides the aforementioned Dynamic Quality Framework defined by
TAUS members, additional recent efforts, which are not fully specific to the
software localization industry, include the Multidimensional Quality Metrics
(MQM) structure defined within the QTLaunchpad project.73 This structure
includes several quality dimensions, such as accuracy, fluency and verity. This
last dimension is of particular interest when adaptation of the source text is
required in the target text as it will be discussed in Section 6.2.
Finally an interesting development in the area of quality standards concerns
a new feature of version 2 of the Internationalization Tag Set.74 This new
feature allows for the annotation of XML and HTML content using specific
quality-related markup, as shown in the example in Listing 5.4.75
In this example, the mrk element delimits the content to annotate.
This element holds a locQualityIssuesRef attribute that refers to the
locQualityIssues element where quality issues are listed. This version of
ITS also introduced two new data categories: the Localization Quality Rating
and the MT Confidence data categories.76, 77 The first is used to express an
overall measurement of the localization quality of a document or an item in a
document whereas the second indicates the confidence score from a machine
translation system for the accuracy of one of its translations (between 0 and
1). These new features should significantly help standardize the exchange of
quality annotations between the various actors involved in the translation
quality processes of localization projects. For instance, the localization quality
rating could be provided by a translation reviewer who is tasked to judge the
quality of a translated document or segment based on pre-defined criteria (such
as those provided by the LISA QA model or Dynamic Quality Framework).
158 Translation technology

<?xml version="l.0" encoding=uUTF-8u?>


<xliff version=Ml.2" xmlns=Murn:oasis :names :tc:xliff:document :1.2"
xmlns:its="http://www.w3.org/2005/ll/its" its:version="2.0">
<file original="example.doc" source-language="en" datatype="plaintext">
<body>
<trans-unit id="l">
<source xml:lang=Men">This is the content</source>
<target xml:lang="fr"><mrk mtype=Mx-itslq"
its :locQualityIssuesRef="#lqlM>c’es</mrk> le contenu</target>
<its:locQualityIssues xml :id="lqlM>
<its:locQualitylssue locQualityIssueType="misspelling"
locQualityIssueComment="’c’es’ is unknown. Could be ,c,est,M
locQualityIssueSeverity="50"/>
<its:locQualitylssue locQualityIssueType="typographicalM
locQualityIssueComment=,,Sentence without capitalization"
locQualityIssueSeverity="30"/>
</its :locQualityIssues>
</trans-unit>
</body>
</file>
</xliff>

Listing 5.4 Annotating an issue in XML with ITS local standoff markup

5.8 Conclusions
This chapter has covered many aspects of one of the core localization processes:
translation. While localization is not limited to translation, localization would
not be possible without it. This chapter reviewed some of the tools and standards
that are commonly used in localization-based translation workflows, including
translation management systems, translation environments, terminology
extractors, machine translation and quality assurance tools. While these tools
can often speed up the translation process (and the overall localization process),
they must be carefully selected depending on the workflow that is being used.
Once again, it must be emphasized that localization workflows can range from
simple operations involving a handful of stakeholders to extremely complex ones
where responsibilities are shared among multiple actors. Regardless of the size
of these operations, the common objective of localization workflows is to adapt
digital content for a number of locales that differ from the one for which the
original content was created.
So far the discussion of adaptation has been extremely limited in this book.
Yes, some adaptation is sometimes required to generate effective translations in
a target language (e.g using equivalent idiomatic phrases). Yes, some adaptation
is required to ensure that time and currencies display properly based on the
conventions of the target locale. But perhaps more importantly, adaptation often
needs to go beyond the act of translating software strings or documentation
content. While an application that allows users to select their preferred language
to display a graphical interface can be useful, it is not necessarily as useful as
having the features that are expected by those users. In other words, translated
Translation technology 159
strings are only one aspect of a truly multilingual application. Other aspects,
which include the ability to manipulate and process content in any language,
will be discussed in Section 6.3.3.

5.9 Tasks
This section is divided into four tasks, covering the topics of translation
management systems, translation environments, machine translation and post-
editing, and translation quality assurance.

5.9.1 Reviewing the terms and conditions of an online translation


management system
In this task, you should identify an online translation management system.
Some pointers have been provided in Section 5.1, but feel free to extend your
search using your preferred search engine. Once you have identified such a
system, you should locate the terms and conditions governing the use of the
system. If you cannot find such text, please use another system. Once you have
identified this text, review it carefully to understand how the content that you
may upload to this system may be handled or used by the system’s owner. Do
the terms and conditions (and the associated data privacy measures) mention
anything about translation copyright? If so, do you think these terms are fair?
Do they correspond to the expectations you had before using the Web site?

5.9.2 Becoming familiar with a new translation environment


The purpose of this task is to introduce you to a translation environment that
you have never used before. The second section of this chapter provided many
pointers on where to find such an environment (either Web-based or desktop-
based), but again feel free to use your preferred search engine to widen your
options. Imagine that you have accepted to work on a translation project
without realizing that your client wanted you to use this specific translation
environment. It is now too late for you to refuse the job, so work your way
through this new environment to translate into the language of your choice
an HTML file.78 This file can be saved to your computer by going to its online
source and clicking on Save Page As (or similar option) in your Web browser.
As you are working through this assignment, take some notes about the
following aspects of the new translation environment:

• Usability: is this environment suitable for the task? Is it enjoyable to use?


• Documentation (and possibly support): is this environment well documented?
Or sufficiently intuitive to be used without much practice and/or training?
• Productivity: do you think it would have been quicker to translate this file
using your preferred environment? Is this due to the features of the new
environment or perhaps due to a lack of familiarity, or both?
160 Translation technology
• Functionality: are some of the features you are used to missing from the
environment? Are there new features you wish were present in your preferred
environment?

5.9.3 Building a machine translation system and doing some post-editing


In this task, you should experiment with the LetsMT! platform to create a
machine translation system for the language pair of your choice as shown in
Figure 5.8.79
Note that the training and tuning steps can take a long time so you may have
to start the creation process and wait until you are notified that your system is
ready before you can use it. Once your system is ready, use it to translate some
sentences that are related to the training data you selected when creating the
system. Take some time to analyse some of the translations to identify frequent
errors that are made by the system. If you see that some phrases are consistently
mistranslated, can you think of additional data sources you could use to re-build
and enhance the system? You should also spend some time post-editing some
sentences translated by the system you have just built, using the TAUS guidelines
for achieving quality similar or equal to human translation.80 Which guideline do
you think is the most difficult to adhere to given the characteristics of this
newly built MT system? Which guideline is the easiest to follow?

Figure 5.8 Specifying data sets for the training phase


Translation technology 161
If you feel confident using the command line and you have access to a
powerful computing environment, you may try to complete the steps that are
required to build a baseline SMT system using Moses instead of LetsMT!81

5.9.4 Checking text and making global replacements


The final task of this chapter deals with the creation of detection and replacement
patterns to automatically correct some machine-translated text. In order to get
started with this task, select a translation memory of your choice and use an MT
system of your choice to translate some source segments from the translation
units (say 1,000 segments). Once the segments have been machine-translated,
compare them with the reference translations, either manually or using a tool
such as SymEval, Meteor X-ray or Rainbow. You should then try to identify
frequent errors that were made by the machine translation system. Using regular
expressions, you should then try to identify these errors automatically and define
replacements with a view to correcting future machine-translated texts in an
automatic manner.

5.10 Further reading and resources


Obviously not all tools could be covered in this chapter, especially as new tools
are being introduced on a very regular basis. Some sources, such as the TAUS
and GALA directories or the Translator’s Toolbox, should be consulted regularly
to discover new tools or learn about new functionality.82, 83, 84 The discussion
on translation memories was short in this chapter as this technology is not
specific to app localization. Additional information on this topic can be found in
Austermuhl (2014). Specific terminology-related issues are also covered in more
detail in Lombard (2006) and Karsch (2006).

Notes
1 An API is a specification indicating how software components should interact with
each other. For instance, a collection of public functions included in a software
library can be described as an API. In other situations, an API corresponds to the
remote function calls that can be made by client applications to remote systems.
2 http://www.linport.org/
3 http://wwww.ttt.org/specs/
4 http://gengo.com/
5 http://developers.gengo.com/
6 http://android-developers.blogspot.ie/2013/11/app-translation-service-now-
available.html
7 https://play.google.com/apps/publish/
8 http://android-developers.blogspot.co.uk/2013/10/improved-app-insight-by-linking-
google.html
9 https://developer.apple.com/internationalization/
10 https://developer.mozilla.org/en-US/Apps/Build/Localization/Getting_started_with_
app_localization
11 https://translations.launchpad.net/ubuntu/+translations
162 Translation technology
12 https://www.transifex.com/projects/p/disqus/
13 https://translate.twitter.com/welcome
14 https://www.facebook.com/?sk=translations
15 https://about.twitter.com/company/translation
16 http://support.transifex.com/customer/portal/articles/972120-introduction-to-the-
web-editor
17 http://docs.translatehouse.org/projects/pootle/en/stable-2.5.1/features/index.
html#online-translation-editor
18 http://www.translationtribulations.com/2014/01/the-2013-translation-environment-
tools.html
19 http://www.translationzone.com/products/sdl-trados-studio/
20 http://developer.android.com/distribute/googleplay/publish/localizing.html
21 http://blogs.adobe.com/globalization/2013/06/28/five-golden-rules-to-achieve-agile-
localization/
22 http://www.jboss.org/ The source files for this guide are provided under a
Creative Commons CC-BY-SA license: https://github.com/pressgang/pressgang-
documentation-guide/blob/master/en-US/fallback_content/section-Share_and_
Share_Alike.xml
23 http://www.nltk.org/book/ch07.html
24 http://wordnet.princeton.edu/
25 http://anymalign.limsi.fr#download
26 http://opus.lingfil.uu.se/KDE4.php
27 http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/
poterminology.html#poterminology
28 http://www.eurotermbank.com/
29 http://www.termwiki.com/
30 https://www.microsoft.com/Language/en-US/Default.aspx
31 http://blogs.technet.com/b/terminology/archive/2013/10/01/announcing-the-
microsoft-terminology-service-api.aspx
32 https://www.microsoft.com/Language/en-US/Terminology.aspx
33 https://www.microsoft.com/Language/en-US/Translations.aspx
34 http://www.ttt.org/oscarStandards/tbx/tbx_oscar.pdf
35 http://www.olif.net/
36 http://www.aamt.info/english/utx/
37 http://www.tbxconvert.gevterm.net/
38 https://www.letsmt.eu/Start.aspx
39 http://www.statmt.org/wmt09/translation-task.html
40 https://www.tausdata.org/index.php/data
41 http://mymemory.translated.net
42 http://asiya.cs.upc.edu/demo/asiya_online.php
43 http://www.statmt.org/moses/?n=FactoredTraining.EMS
44 https://labs.taus.net/mt/mosestutorial
45 http://www.precisiontranslationtools.com/products/
46 https://www.letsmt.eu
47 http://www.kantanmt.com/
48 https://hub.microsofttranslator.com
49 https://evaluation.taus.net/resources/guidelines/post-editing/machine-translation-
post-editing-guidelines
50 http://msdn.microsoft.com/en-us/library/hh847650.aspx
51 http://www.matecat.com/wp-content/uploads/2013/01/MateCat-D4.1-V1.1_final.
pdf
52 http://symeval.sourceforge.net
53 www.cen.eu/
Translation technology 163
54 h t t p : / / w w w. l i c s - c e r t i f i c a t i o n . o r g / d o w n l o a d s / 0 4 _ C e r t S c h e m e - L I C S -
EN15038v40_2011-09-01-EN.pdf
55 http://www.huffingtonpost.com/nataly-kelly/ten-common-myths-about-
tr_b_3599644.html
56 The LISA QA Model was initially developed by the now defunct Localization
Industry Standards Association (LISA). Since this model was not a standard, it is no
longer officially maintained.
57 https://evaluation.taus.net/resources-c/guidelines-c/best-practices-on-sampling
58 http://www.dog-gmbh.de/software-produkte/errorspy.html?L=1
59 http://www.qa-distiller.com/
60 http://www.xbench.net/
61 http://www.opentag.com/okapi/wiki/index.php?title=CheckMate
62 http://opus.lingfil.uu.se/KDE4.php
63 http://wiki.languagetool.org/checking-translations-bilingual-texts
64 http://docutils.sourceforge.net/rst.html
65 http://sphinx.readthedocs.org/en/latest/intl.html
66 http://www.digitallinguistics.com/ReviewSentinel.pdf
67 https://github.com/lspecia/quest
68 http://www.quest.dcs.shef.ac.uk/quest_files/features_blackbox_baseline_17
69 http://www.quest.dcs.shef.ac.uk/quest_files/features_blackbox
70 http://www.quest.dcs.shef.ac.uk/quest_files/features_glassbox
71 https://evaluation.taus.net/
72 http://standards.sae.org/j2450_200508/
73 http://www.qt21.eu/launchpad/content/multidimensional-quality-metrics
74 http://www.w3.org/TR/its20
75 http://www.w3.org/TR/its20/examples/xml/EX-locQualityIssue-global-2.xml
Copyright © [29 October 2013] World Wide Web Consortium, (Massachusetts
Institute of Technology, European Research Consortium for Informatics and
Mathematics, Keio University, Beihang). All Rights Reserved. http://www.w3.org/
Consortium/Legal/2002/copyright-documents-20021231
76 http://www.w3.org/TR/its20#lqrating
77 http://www.w3.org/TR/its20/#mtconfidence
78 http://okapi.googlecode.com/git/okapi/examples/java/myFile.html
79 https://www.letsmt.eu/Register.aspx
80 https://evaluation.taus.net/resources/guidelines/post-editing/machine-translation-
post-editing-guidelines
81 http://www.statmt.org/moses/?n=Moses.Baseline
82 http://www.gala-global.org/LTAdvisor/
83 https://directories.taus.net/
84 http://www.internationalwriters.com/toolbox/
6 Advanced localization

Chapter 4 introduced basic localization concepts, focusing on a somewhat


traditional approach to textual content localization, whereby the source is
created in a specific, internationalized manner so that it can be easily extracted,
translated, and merged back into a localized version. While this approach can
be very effective for the visual, textual components that make up a software
application (e.g. user interface strings or frequently asked question content), it
does not address the situations when more complex transformations are required.
From a user’s perspective, these situations may occur at any step of an application’s
life cycle, as shown in Figure 6.1.
An end-user application life cycle can be divided into three main phases:
the phase when a user discovers an application (either by coming across some
advertising material or searching for a type of application using a search engine);
the phase when a user acquires an application (by possibly purchasing it,
downloading it and installing it); and the phase when the user actually interacts

Discovery
Search
Experience

Usage Acquisition
Use Purchase
Get Help Download
Learn Install

Figure 6.1 Application life cycle: a user perspective


Advanced localization 165
with the application in order to perform some task. During these phases, an end-
user is exposed to content and functionality that are either part of the application
itself or that belong to its digital ecosystem (e.g. marketing material, training
video). This exposure is framed by user expectations and requirements, some of
which can be culturally motivated. As far as application publishers are concerned,
a major goal in capturing and retaining a user is to meet or even exceed these
expectations by identifying situations which require content to be adapted. The
first section of this chapter focuses on those adaptation scenarios that are related
to non-textual content types, including images, audio and video.
Other adaptation situations may arise when the main element to keep from
the source content is the intended impact on the target audience. However, the
content format, structure and words must be completely changed to achieve a
similar effect on a target locale. While these considerations are most likely to
apply in the context of marketing content (where the main goal behind content
creation is to convince users to make a purchasing decision), they may be useful
for other content types as well (e.g. e-commerce content which users interact
with to purchase applications or informative content that should truly engage
the users). The second section of this chapter provides more detail on such
adaptation processes and its implications for multilingual applications.
Also, the traditional approach to content localization does not address the
functional dimension of a given application. While translating an application’s
user interface may go a long way in competing with native market applications,
additional functionality adaptation may be required to win market share. For
instance, a tax calculation application can have its user interface and associated
documentation content translated into multiple languages but this is likely to be
of limited value to target locale users if this application is not adapted (or localized
from a functional perspective) to support multiple tax regimes (which may be in
place in various target locales). From a translation perspective, one may decide
that this type of work is beyond the scope of traditional translation workflows.
However, translators with advanced text processing or language engineering skills
may be instrumental in helping identify the shortcomings of a given application
in a given locale. For instance, if translators are being asked to provide or check
translations in-context by using a localized application, they may realize that
this application is behaving in an unexpected manner because of a lack of
functional adaptation (i.e. while the interface appears in the target language, its
functionality is inadequate for the target locale). Once such shortcomings have
been identified, adaptation requirements can be defined. Once again, language
engineers can play a fundamental role in this phase, especially if the application’s
functionality that requires adaptation is concerned with content processing. The
third section of this chapter will provide examples of functionality adaptation
that are required in order to offer full multilingual support to specific applications.
Finally, another factor should be taken into account for an application to be
considered truly localized and multilingual. Even though this factor may be of
limited interest to translators because it cannot be influenced through traditional
translation work, their linguistic work may be compromised if this factor is
166 Advanced localization
neglected. This factor is the ability to provide multilingual applications that will
not compromise the user experience regardless of the user’s location. In such a
scenario, the system infrastructure that supports a given multilingual application
must be architected in a user-aware manner so that content and services are
located as close as possible to their user base in order to minimize slow response
times (and frustration). Adapting the location of an application forms the final
section of this chapter.

6.1 Adaptation of non-textual content


Multiple non-textual content or media types may exist in an app’s digital
ecosystem. Some of these elements may be associated with textual elements in
specific components. For instance, it is very common for user assistance content
(e.g. user manuals) to contain both text and graphics, such as screenshots or
diagrams. Another example of media combination concerns video-based tutorials
or training videos (also known as screencasts) since these can sometimes be
equipped with text subtitles. This section discusses the following content types
in sequence: screenshots, other graphic types, audio and video.

6.1.1 Screenshots
Some of the graphics present in user assistance content are screenshots or screen
captures, showing specific parts of an environment in which something happens
or needs to be done. The term environment refers here to the Graphical User
Interface (GUI) of a program or set of programs. In user assistance content, some
sections are often illustrated with screenshots whose purpose is to guide users
in step-by-step procedures, such as activating a particular function, modifying
certain settings, or removing an application. Since screenshots sometimes
perform the same function as text instructions, one may wonder why one is
used instead of the other, or why both are sometimes used together. Elements
of answers to this question may be found in a study (Fukuoka et al. 1999) which
found that American and Japanese users believe that more graphics, rather than
fewer, make instructions easier to follow. This study also revealed that users
prefer a combination of text and graphics, which they believe would be more
effective than text-only instructions. From a semiotic perspective, screenshots
play an iconic role (Dirven and Verspoor 1998), because they provide users with
a replication of the environment with which they are interacting.
Screenshots may also provide an illustration of some of the steps users should
follow to fix a problem. Technical support screenshots can sometimes be edited
by content developers to provide extra information to users. Information can
be added using text or graphical drawings, such as arrows or circles, to draw the
attention of the user to a certain part of the replicated GUI. These elements are
examples of an indexing principle (Dirven and Verspoor 1998: 5) because they
draw the attention of the user to a particular action that should be performed, or
to the result of an action. This principle allows users to isolate the component
Advanced localization 167
of the GUI which requires action. As a result of this quick link between
form and meaning, screenshots may replace procedural sentences containing
instructions to find the location of a graphical item, be it a button, a tab, a
pane, a window, a menu bar, a menu item, or a radio button. These elements
may also have an iconic function by replacing the action that the user should
perform on one of these items: to click, to check, to uncheck, or to enter a
word. From a multilingual communicative perspective, those screenshots should
of course be in the language of the users so that their primary iconic function
can be fully performed. However, this is not always possible, because third-party
English applications are not always localized. The handling of screenshots is
therefore a complex localization process. It is sometimes difficult or impossible
for a human translator or quality assurance specialist to find the corresponding
screenshot in his or her own language. The time required to perform such a
search during the translation process should therefore not be underestimated.
This is not the only drawback of screenshots when they are included in technical
support documents. Screenshots can also create accessibility issues for users with
eyesight-related difficulties. If screenshots are not accompanied by alternative
text as discussed in Section 3.3.1, they may be ignored by accessibility tools
such as screen narrators. Besides, a screenshot may impact on the reliability
of a document over time, or at least baffle users running older versions of the
product for which the document was originally intended. This situation can
happen when the GUI changes over time. For instance, a document applies to
several versions of an operating system when the text used in the document does
not focus on any particular version. If a screenshot is introduced, the document
may be perceived as version-specific by certain users. If the screenshot does not
exactly match their environment, certain users may come to the conclusion that
the document does not apply to them.
Screenshots may also be included in other document types. For example,
they are increasingly used in pages associated with the description of mobile or
platform-specific applications that can be downloaded from specific Web sites
(often referred to as app stores), as shown in Figure 6.2.
These descriptions, which are consulted by prospective users during the
discovery phase, may contain a mix of text and graphics, so having content that
seems relevant to potential users is essential. When screenshots are used in this
context, their main function is to promote an application by giving users a quick
view of the application’s main functionality. Since users’ decisions to select a
particular application in a given application category (e.g. a calendar application)
are made quickly based on the increasingly large number of applications
available, screenshots should be both engaging and relevant. For example, if the
application has been localized from English into French and German, providing
screenshots with English text (either in the User Interface or in input fields) may
be detrimental to future uptake in French- and German-speaking locales. It is also
not always sufficient to provide localized screenshots if the examples contained
in the screenshots are not relevant. In the case of a restaurant recommendation
application targeting Japanese-speaking users, providing an example of a search
168 Advanced localization

Figure 6.2 Add-on repository for the Firefox Web browser

for restaurants in San Francisco may not be as powerful as a search for restaurants
in Tokyo.1
To some extent, this characteristic also applies to video clips (or videos) that
are sometimes linked to technical support documents. These video clips contain
step-by-step tutorials designed to help users find an answer to their question.
From a localization perspective, this type of element is even more complex than
static screenshots, while having the same pragmatic function as plain text. Some
aspects of the localization of this type of content are covered in Section 6.1.3
once other graphic types have been discussed.

6.1.2 Other graphic types


Screenshots are not the only types of graphics that may have to be localized in
an application’s interface or documentation. For instance, it is possible for some
applications to rely on graphics to create some of the interface’s elements, such
as menus or toolbars, instead of relying on software strings. This approach can,
however, be problematic from a localization perspective as the editing of text
embedded in graphic files is time-consuming and requires the use of dedicated,
image-editing software. Besides, while Internet connections are now faster than
they were at the beginning of the 2000s, the observation that ‘images take much
Advanced localization 169
longer to load in web browsers than text’ is still valid (Esselink 2000: 357).
Web applications that make heavy use of such graphics may therefore feel less
responsive than those that rely solely on textual strings.
Other common graphic types are icons that may be used instead of strings
to help users navigate an application. The localization of such elements can be
described as an adaptation task if some icons are deemed to be unsuitable for a
specific locale. Adaptation may be required if alternative icons or images are
deemed more popular and intuitive with users from specific countries.
Geo-political considerations must sometimes be taken into account when
localizing specific graphic types for target locales. For instance, political
sensitivities can be affected by particular views of the physical world. Such views
can be exemplified by the use of flags or world maps, even though borders or
countries are not always recognized in a consistent manner in all of the world’s
locales. Selecting a fixed representation of the world may lead to the rejection
of an application, so adaptation is sometimes required. While this work focuses
more on graphics than text, text can obviously be affected if a name has to be
removed or introduced during the translation process (translation by ablation
or creation). In the same vein, religious or gender-biased references or views
that may offend cultural sensitivities must be handled with care during the
localization process. Such issues are most likely to affect creative content subject
to transcreation (e.g. marketing content). For this reason, the Microsoft Style
Guide suggests that ‘a thorough understanding of the culture of the target market
is required for checking the appropriateness of cultural content, clip art and other
visual representations of religious symbols, body and hand gestures’ (Microsoft
2011: 27).

6.1.3 Audio and video


Localizing rich media content, such as training videos used to explain some
product functionality, is obviously much more complex than localizing textual
content. This is due to a number of factors: first of all, videos are not consumed in
the same manner as textual content. While textual content can be read multiple
times to clarify things, it is more cumbersome to pause and rewind a video to view
or listen to one of its sections again. This means that the text or audio used in a
video should be as clear and fluent as possible in order to avoid confusing users
with unusual words or phrases. Second, the visual content present in a video may
be referred to by the voice-over, which means that it may not always be possible
to have a perfect user experience through localization. This challenge is similar
to the one that was described in Section 6.1.1 when a video contains a recording
of a specific user environment (e.g. showing an English operating system and
applications). Even if the video is localized with target language subtitles, the
user environment will remain in the source language (e.g. English), which is
bound to confuse users from other locales.
In some cases, creating a video in another language from scratch may be
the only way to ensure that both the visual content and the voice-over are in
170 Advanced localization
the same language. Finally, actors are sometimes used in training videos. If the
localized material is going to be as effective as the source material, then issues
such as lip synchronization and professional delivery must be addressed during
the localization process. Some of these issues (such as the hiring of professional
native actors or the editing of animations to accommodate translations) cannot
be covered in this book. Instead, the focus of this section is placed on video
subtitling and voice-over script localization.

Video subtitling
While it is not necessary to have access to a voice-over transcript to create
localized subtitles, its presence can simplify the translation process. For example,
a translation memory could be used to leverage previous translations based on
an analysis of the source transcript. Three steps are required to generate video
subtitles: the actual creation of the subtitles, the synchronization of the subtitles
with the audio track and a final review to refine the translation. All of these steps
can be performed using dedicated software, such as the online Amara service.2
This service is maintained by the Participatory Culture Foundation, which is a
‘non-profit organization building free and open tools for a more democratic and
decentralized media’.3 This online service allows users to generate subtitles in the
language of their choice using the interface shown in Figure 6.3.
The goal of the first step is to type translations for the words that correspond
to the words spoken in the audio track. In the case of a product tutorial, these
words are spoken by an instructor who may be describing steps to achieve a
particular objective (such as installing a product or using a particular product
feature to perform a task). The Amara software automatically stops every eight
seconds to make sure that the narrated text is broken down into manageable,
easy-to-remember chunks. During this first step, typing mistakes can be made

Figure 6.3 Amara subtitling environment


Advanced localization 171
and words can be skipped since these mistakes can be fixed at a later stage. The
second step is to synchronize the words typed in the first step with the audio and
the visual background (by specifying how long a subtitle should stay on screen).
In this step, it may be necessary to remove a few words to shorten some of the
subtitles. Information that is deemed redundant can be easily skipped to improve
the final user experience. Shorter subtitles are easier to read and understand. The
final step is a review step, during which further quality checks can be made. The
following guidelines are provided by Amara:

• Include important sounds in [brackets].


• Include text that appears in the video (signs, etc.).
• It’s best to split subtitles at the end of a sentence or a long phrase.

While the first guideline applies to subtitles that are aimed at hard-of-hearing
users, the second guideline is extremely important because this text cannot
be localized without re-shooting the video. While this guideline can be easily
applied when the number of signs is small, it becomes much more challenging
when a user interface is being recorded in the context of a tutorial (screencast).
Having to add subtitles for all GUI labels that are being clicked by a user may
be impossible, especially when the user is also describing the actions they are
performing. In such an extreme case, it would seem preferable to record the video
in the target language with a target GUI. The objective of the third guideline is
to improve the final user experience, by making sure that sentences are not split
in an unusual way. Unless short sentences are used, however, the implementation
of this guideline might result in having the user read a substantial amount of
information on screen in one go. From a comprehensibility perspective, it seems
preferable to avoid having to remember what was said in previous subtitles.
Having standalone subtitles does not only improve comprehensibility, it also
improves translatability as explained in the following section.

Voice-over script localization


One of the translation challenges to address when localizing voice-over scripts
(or subtitles) concerns incomplete sentences. If a sentence is broken into two
(or three) separate subtitles, it may be very difficult (or even impossible) to keep
the same structure in the target language without introducing comprehensibility
problems. This is particularly true for languages whose word order differs from
the source language. For example, when the source language is English and the
target language German, word order differences exist, especially with regard to
the position of the main verb of a given sentence. While this verb may appear
early in an English sentence, its translation may have to be pushed to the end
of the German sentence. If German-speaking viewers have to wait six seconds
(corresponding to two frames of subtitles) to establish a link between the subject
and the verb of a sentence, their cognitive load will increase due to the amount
of concentration required.
172 Advanced localization
Like software localization, voice-over script localization must also handle
space and time constraints. According to Pedersen (2009), no more than 12
characters per second should be used when creating subtitles. This means that
translations cannot be too verbose, especially when the source content contains
many references to the source culture. In his article, Pedersen reviews a number
of translation strategies that can be used to deal with words or phrases that
are culturally loaded. While such phrases may not be as frequent in software
product tutorials as in creative films, they may appear if the video author wants
to educate its audience in a personal or engaging way. In such a case, it would
not be unusual for references to people or places to be made, which may lead to
tough translation choices (possibly requiring some extensive research). Pedersen
presents two types of strategy: minimum change and intervention. A minimum
change can result in using an official translation or a word-for-word translation,
while an intervention requires additional work. Interventions can sometimes
result in adding words to clarify what a place is or who a person might be. For
example, a screencast recorded for users of a Windows program may refer to a
Linux operating system that is popular in their region, as in The program I am
going to show you is much better than its CentOS counterpart. When localizing the
voice-over script, it may be necessary to add an explanatory phrase to clarify what
CentOS is (as in The program I am going to show you is much better than the program
running on the CentOS Linux distribution). As shown by this example, this type of
translation technique can result in longer texts, which is why generalizations are
often used. Using such a technique, the specific nature of the original sentence
disappears in order to favour a more concise description, such as The program I
am going to show you is much better than its Linux counterpart. The final type of
intervention is a substitution, which occurs whenever a concept does not exist
in a target locale. If for some reason, CentOS or Linux were completely unknown
in a target locale, using an equivalent term may be a last-resort solution, such
as The program I am going to show you is much better than its Mac OS counterpart.
While this type of transformation obviously distorts the meaning of the original
sentence, it tries to avoid confusing users with concepts and terms that may be
too obscure.
This section has focused on some (simple) aspects of video localization, where
online platforms such as Amara are being used by multilingual users to provide
(translated) subtitles, often based on volunteering. The approach presented
in this section is proving extremely popular with content creation frameworks
such as TED videos or massive online open courses (where zealous students
provide free translations to their peers).4 It should be noted, however, that
professional services can also be used to supplement volunteer translations. For
instance, subtitled YouTube videos can now also be translated using professional
translation services such as Gengo and Translated.net.5 This example shows that
the selection of a video format that supports an easy way to integrate subtitling
(be it in the original language of the video or in a target language) is extremely
important. Other video formats may require additional steps. For example,
localization guidelines are available for the processing of Flash files.6
Advanced localization 173

6.2 Adaptation of textual content


Textual adaptation takes place when extensive local research is required to find
words or phrases in a target language that map to words or phrases in the source
language in a non-obvious manner. An area where such textual adaptation is
needed is Search Engine Optimization (SEO). SEO means trying to match
the popular keywords users rely on to search for content using a search engine
with the words that should be present in the content that publishers want to
see accessed by users. This phenomenon is relevant for Web content, such as
marketing content used in e-commerce sites or technical content used in support
sites. The use of this technique is not 100 per cent accurate since search engine
providers adjust the way content gets ranked from time to time. However, it can
help publishers promote content, which would otherwise have been translated
and published but possibly not found by users. This situation is due to the fact
that translators cannot always guess which terms (or term variants) users use
most (especially if the most preferred term form contains a spelling mistake). As
mentioned in the previous section, translators have to make hard choices during
the translation process, but such choices may have to be edited based on search
usage data. Such an approach can lead to linguistic challenges if little attention is
paid to the way terms are replaced in the translated content. Morphologically rich
languages have multiple term forms for a given term so basic string replacements
are likely to create issues. Similarly, compounds are likely to be affected if string
replacements are performed on shorter strings (e.g. the word killer can be found
in killer feature but it has a completely different meaning).
Search engine optimization relies on finding keywords that are popular
among a particular market segment. This technique is also relevant as far as
application (or app) stores are concerned. These stores, which are growing in
popularity to find and download applications, can be searched using keywords.
From a localization perspective, translating such keywords (often out-of-context)
is often not satisfactory. Instead, keyword localization involves detailed research
which, according to an online report, can really pay off in increasing application
downloads.7 Such research may involve:

• Finding a list of keywords and search terms that users use in order to find
applications.
• Understanding the keywords that are used by competing applications.
• Identifying popular search terms that are currently matched by a small
number of applications (or even better, by no application).

Once these steps are completed, the identified terms can be used in the
application’s description or in any field that is indexed by the application
repository’s search engine. As shown by the nature of these steps, such an activity
is very different in nature from a traditional approach to translation, hence the
need to categorize it under adaptation. Two other types of textual adaptation are
discussed in the remainder of this section, transcreation and personalization.
174 Advanced localization

6.2.1 Transcreation
Transcreation was briefly introduced in Section 1.3. The challenge with this
concept is that it sometimes overlaps with translation. After all, adaptation is
one of the translation techniques that translators rely on to transfer elements of
a source text into a target text. Examples of such elements include ‘prices [that]
should be in local currency and phone numbers [that] should reflect national
conventions’ in translated texts (DePalma 2002: 67). The adaptation of such
elements can be challenging because changing a currency symbol to another and
using a standard conversion rate is unlikely to be sufficient because of specific,
local pricing strategies. Changing a phone number by adding a prefix is also
unlikely to be sufficient because a phone number in Germany is not going to
be useful for a customer based in Japan who is looking for technical support in
Japanese during the Usage phase. Determining whether such equivalent phone
numbers or addresses are available may not always be straightforward because
some locales may not have dedicated local support teams, especially if support
teams are shared across multiple locales. Obviously, this type of adaptation should
be identified early in the source content creation process, so that specific measures
can be taken. For instance, specific supporting materials may be provided to
translators as part of the translation guidelines or this type of content can be
excluded from the translation process altogether. So if adaptation is part of the
translation process, why is a new term such as transcreation required?
When a few adaptation issues are scattered across an informative document
(say, a user guide), a standard translation process such as the one presented in
Section 4.3 can be used. When such adaptation issues appear throughout a
document that is trying to trigger a reaction from the user, however, another
strategy may have to be considered. Rather than being specific about how the
content should be translated (e.g. using tools, guidelines and reference assets),
translators may be given complete carte blanche to create a document in the
target language that matches the intent of the source text. For instance, the
Mozilla foundation provides the following adaptation guidelines for Web site
content, campaigns and other communications intended for a general audience:
‘Localized content should [not] be a literal translation, but it should capture the
same meaning and sentiment. So feel free to pull it apart and put it back together;
replace an English expression with one from your native language; Mozilla-fy it
for your region.’ 8
Ray and Kelly (2010: 3) indicate that ‘typical projects that require
transcreation include Web campaigns that do not attract customers in other
markets, ads that are based on wordplay, humour that is directly related to just
one language or culture, or products and services that need to be marketed to
diverse demographics within the same market’. Obviously this creative process
requires more time than standard translations because multiple variants may have
to be considered before an acceptable wording is found in the target language.9
In these situations, translation is often going to be inadequate, which is why
transcreation is required. While translation endeavours to somehow reuse some
Advanced localization 175
aspects of the source text (e.g. information structure), transcreation seeks to have
the target text achieve the same high-level goal as the source text or brief (e.g.
convincing users to buy a product). In such a scenario, the source words, phrases,
sounds and structure no longer matter: it is all about leveraging target cultural
norms and expectations.
It is worth mentioning that the use of transcreation may create some challenges,
in terms of cost and brand protection, as mentioned by the head of Marketing
Localization at Adobe: ‘The challenge here is the balance between giving more
flexibility and freedom of expression to the regions and the use of productivity
tools such as Translation Memory. If we want to leverage the savings that TMs
and other tools offer to localization (and we do), we can offer some flexibility in
the target content but not as much as sometimes the regions would like to have.
[…] We are very protective when it comes to the Adobe brand and although the
regional offices are given some flexibility in terms of creating some of their own
marketing materials (in their original language), Adobe’s Brand team normally
is involved to make sure the materials follow the established international brand
guidelines.’10 A more detailed discussion of the techniques that can be used when
dealing with text types such as marketing or advertising documents can be found
in Torresi (2010).

6.2.2 Personalization
Another type of textual adaptation is personalization. Personalization can be
achieved using a couple of approaches, but only one of them is relevant to the
present discussion. The first approach consists in focusing on linking content to
other content based on specific attributes. For instance, some content may be
recommended to a user who has previously consumed specific content. This type
of personalization does not need to take into account local knowledge.
The second type of personalization, which is the focus of this section, refers
to the adaptation process that is required to meet the expectations or needs of
specific individuals (or even of a single individual). Such individuals can be
grouped into personas whose specific characteristics guide the content creation
or personalization processes.11 The advantage of this approach is that it is not
constrained by pre-defined characteristics, such as the location of a user. While it
might be tempting to assume that users from a specific region are likely to behave
in a similar manner, one should not forget that other factors can come into play,
such as age or fields of interests. For instance, a user based in Germany (who
happens to study English in college) may have more in common with an American
user of the same age group than with another German person from a different
age group. This means that targeting users by focusing exclusively on location
can be sub-optimal. Rather than assuming that a user should be presented with
content in a language based on its geographical location (e.g. German if the IP
address of the system making a Web request is associated with Germany), content
publishers can take into account users’ linguistic preferences. Such preferences,
which are captured in the language preference settings of a Web browser, are
176 Advanced localization

Figure 6.4 Language preferences

usually sent to Web servers in the Accept-Language HTTP header.12 By default,


this value should match the installation language of the Web browser, but users
can add other languages if they want to inform Web servers that they prefer or
are capable of consuming content in other languages, as shown in Figure 6.4.13, 14
The example with the tick from Figure 6.4 shows that the preferred way to
specify language variants is to indicate a generic fall-back value (fr for French)
in case no resources exist for the French variant from Switzerland (fr-ch). In the
example with the cross, the fall-back value would be de (German).
As DePalma (2002: 71) puts it, ‘each consumer and every corporate buyer is
driven by a complex set of psychographic motivators that determine how [(s)]
he reacts to a marketing message, brand, or online selling process.’ Software
applications have always offered users various ways to customize the look and feel
of the interface that is used to perform certain tasks. For instance, users can often
change the colour scheme, font size or menu location of an application in order to
be more productive or simply to pass the time. As far as text and translation are
concerned, examples of personalization are limited in the display of user interface
strings or documentation. For instance, it is currently not common for users to be
able to change an application in such a way that a Folder menu item is renamed as
Directory based on their preferences. However, marketing specialists have realized
that (prospective) customers are extremely sensitive to small term changes. Using
techniques such as A/B testing (and tools), it is possible to determine whether
users are more likely to make a decision (such as purchasing an item) if the product
is described to them using a specific set of terms (Kohavi et al. 2009).15, 16 Since
more and more data is now collected online, it is therefore possible to draw user
profiles and personalize a message almost on a user-per-user basis. From a translation
perspective, this leads to an interesting challenge, because translators often (if not
always) have to make terminological choices based on what they think a target user
might look like. While such choices can be acceptable for a large number of users,
they can sometimes alienate other groups of users. In the technology sector, it is
Advanced localization 177
quite common to borrow technical terms from the English language in the target
language instead of translating them. While this may appeal to a certain (younger?)
audience, it may leave certain users with a bad impression. Rather than having to
make such hard choices during the translation process, one could envisage making
a soft choice during the translation, a choice which could be user-overwritten
when the content is displayed. As far as Web content is concerned, this can for
instance be achieved using HTML5 by marking specific terms of phrases using a
span element so that the content of this element can be replaced with the user
choice.17 The use of this replacement technique may not be currently mainstream
in content publishing circles, but it may become prevalent in years to come.

6.3 Adaptation of functionality


Sometimes the modifications that are made to an application to meet the
expectations of a target market go beyond the standard feature set offered by the
base application, even when this application has been internationalized to the
extent offered by the programming language or framework used to develop the
application. Such modifications fall into three categories: local regulations, local
services and core functionality.

6.3.1 Regulatory compliance


As mentioned in Section 1.1.3, regulatory localization is sometimes required
when an application (or Web service) must deal with specific laws or regulations,
such as tax regimes. To address this challenge, Collins and Pahl (2013) suggest
using standards-based mappings to achieve regulatory compliance with regionally
varying laws, standards and regulations. Dealing with such challenges is not
new since desktop-based applications have had to rely on such mappings in the
past. Web services, however, are very different in nature because they can in
theory be consumed from any geographical location by any user. The approach
presented by the authors suggests having an intermediary system (known as the
mediator), which can ensure that the request submitted by a user (or system) will
be answered using the appropriate format (e.g. in terms of currency, VAT rate,
etc.) An example of this approach would be a user based in France who decides
to order a product using an e-commerce site located in China. The e-commerce
site would have to take into account certain user characteristics (such as the
preferred currency to use to display prices during the purchase transaction,
any tax that may have to be added to the original price in order to comply
with French or European regulations, and the generation of translated product
documentation in order to comply with French language laws). As much as
possible, regulatory adaptation should be automated thanks to the availability
of pre-defined mappings (e.g. if the user locale is fr-FR, then the following
services must be used during a transaction). Unfortunately, no central mappings
repository currently exists, so it is easy for Web service providers to ignore or
bypass local regulations.
178 Advanced localization

6.3.2 Services
More and more applications (whether they are Web, mobile or desktop
applications) are connected to Web services in order to provide functionality
that may not be practical or desirable to provide in the application itself. For
instance, it is currently not practical to access a generic search engine on a mobile
device without being connected to the Internet. The computing power required
to perform a standard Web search is well beyond the capability of most modern
mobile devices. It is also convenient for software publishers to make some of their
functionality available as Web services instead of packaging them in standalone
applications. Even if standalone applications are published in closed, proprietary
formats, it is always possible to reverse engineer them and access their source
code. Making key functionality available as a Web service therefore allows
software publishers to keep their code away from curious eyes.
Examples of such Web services include services providing weather forecast
predictions, news information, stock market values, text or speech translation,
information search results, etc. Some Web services are obviously more popular
than others depending on the locale where they are available. According to
an online report, while most worldwide users tend to favour the Google search
engine, most Chinese users tend to rely on the Baidu search engine while most
Russian users tend to rely on the Yandex search engine.18 For an application to
have the expected impact in any given locale, such local preferences have to be
taken into account. For instance, if an application (such as a word processing
application or a reference management application) allows users to perform
searches using a specific search engine service, this functionality may have to
be adapted to either support additional search engine services or replace the
existing one with a local service. Online services, such as eBay or Google News
may not be as popular (or even available) in other locales, so adaptation may
be required to tailor this list for specific locales. Such adaptation work may be
labour-intensive, especially if such services do not rely on industry standards to
receive and respond to requests. The work can also be further complicated if some
services are not documented in the language of the developer who is responsible
for the adaptation. A good example of service adaptation was provided by Apple
in 2012 with one of the releases of their OS X operating systems. This complex
piece of software was specifically adapted for the Chinese market, so that its users
could select Baidu search in the Web browser, set up their contacts, mail and
calendar with service providers such as QQ, 126.com and 163.com, or upload
videos to the Youku and Tudou Web sites.19
Another example of service adaptation that may be required to meet
the expectations of local users is related to the way local payments are made.
While some credit card brands are very popular in many countries, other forms
of payment exist. When popular forms of payment are unsupported, users are
left frustrated and customers are lost. As an example, the publisher of the Clash
of Clans mobile game encountered an issue with Chinese users because it was
soliciting in-application payments through a specific market store application
Advanced localization 179
which was unsupported in China.20 For this reason, it is now becoming customary
for global businesses to support local payment providers, such as allpago in Brazil
or Alipay in China.21, 22

6.3.3 Core functionality


Any application that processes language in the form of text (be it in terms of
search, segmentation, sorting, linguistic checking, translation, classification,
summarization) may be subject to core functional adaptation. For instance a word
processing application is expected to offer spell-checking, grammar checking
and possibly style-checking functionality to its users. Having the interface of
such an application translated is useful, and so is the ability to enter text in any
given language using a popular input method (e.g. Chinese using Pinyin). But if
a key feature like spell-checking is not adapted for a given locale, the value of the
application will diminish. For this reason, applications like Microsoft Office ‘offer
multiple spelling-checker engines. They also offer considerable assistance to the
international user through a grammar checker, thesaurus, hyphenation options,
and bilingual translation dictionaries’ (Dr. International 2003: 417). Similarly,
a spam filtering application may require to have its core functionality adapted
the technology used to classify some email as spam relies on text processing-
based feature engineering (e.g. keyword extraction). Since the notion of what a
word is is not always clear (let alone that of a keyword), specific adaptation work
may be envisaged to ensure consistent spam detection results across languages.
However, quality language resources (such as word lists, dictionaries or tools)
are often unavailable in the standard library of a programming language. Very
often, specific libraries do not offer consistent language coverage (e.g. while the
Natural Language ToolKit has built-in support for some corpora and trained
models, most of these resources are English-specific).23 This means that data
acquisition and resource creation may have to be considered as essential steps
in an adaptation-based localization workflow. For example, the META-NET
network of excellence published a number of white papers describing the level of
maturity and support per European language for a number of language processing
domains (such as machine translation, speech processing, text analytics, and
speech and text resources).24 Interestingly, no language achieved excellent
support and only one language achieved good support (English). French and
Spanish achieved moderate support in all four categories. This classification
suggests that developing truly multilingual applications is a challenging task,
which requires some substantial investment. This also means that by default
most applications (especially in their first versions) do not support a wide range
of languages. For instance, MongoDB, which is a popular, recent, open-source
database technology, recently announced support for text search for a number
of languages. All of these languages are European languages, which means that
Asian users currently do not benefit from a complete feature set.25 This is due
to the fact that Asian languages require specific tools to perform basic tasks
such as word segmentation, and most tools (or libraries) offering support for
180 Advanced localization
both European and Asian languages in a robust manner tend to be commercial
solutions.26, 27
Whether an application uses rules, statistics or machine learning to process
text, functional adaptation is likely to be required. LanguageTool is another good
example of a rules-based application, which relies on dictionaries and language-
specific tools (such as sentence splitters, tokenizers and part-of-speech taggers).
While most of these tools have been commoditized for a number of languages,
their accuracy remains language-specific so locale-specific testing may be required
to ensure consistent performance across languages. To conclude this section, it is
also worth briefly mentioning that this linguistic challenge is not limited to text,
since speech-based applications (which are becoming increasingly popular with
the advent of hands-free, voice-based communication methods) must be able
to handle the language mostly spoken by its user. Some of these challenges and
associated solutions have been experienced by Google when working on various
applications (such as YouTube transcription, Voice Search in Desktop or Voice
Actions in Android).28

6.4 Adaptation of location


The last section of this chapter focuses on the location of the physical
infrastructure that is required to serve translated content, applications or services
to end-users (or even other systems). While this topic may not seem relevant
to translators at first sight, it is, however, extremely important if translations
produced by translators are to be consumed by end-users. The discussion in this
section is limited to providing a high-level overview of some of the challenges
and opportunities associated with this topic.
The term infrastructure is used to refer to the physical servers that are used to
make content or functionality available to users (whether users are individuals
or services controlled by individuals). As far as Web applications are concerned,
content and functionality are usually made available by a Web server, which is
a piece of software that is being executed on a physical or virtual system. Even
in the case of a virtual system, a physical server is still required, which means
that there are always a number of hops between the source device (making the
request) and the destination server (responding to the request).29 Obviously the
more hops between the source and the destination, the higher the latency (which
is the time delay experienced by a system). From an end-user’s perspective, this
can result in frustration if requested content takes a long time to be served.
In order to work around this problem, some service providers have taken a
fresh approach to serving localized content. Rather than relying on a master
server located in one physical location to serve multiple regions, multiple servers
can be used throughout the world. Each content request is then analysed in
order to determine the best server to use. This system is known as a content
delivery network or content distribution network (CDN). An additional
benefit of using such a system is that it provides high availability (if there was a
problem with one server, another one could be used to serve the content, thus
Advanced localization 181
minimizing or eliminating downtime). This approach has been leveraged by
localization providers, such as Smartling’s Global Delivery Network or Reverso
Localize, to bypass traditional workflows tied to a single master multilingual
content management system.30, 31 This new approach relies on a combination
of surface localization (as described in Section 4.2.8) and the use of local servers
to publish localized content. Such an approach can be extremely effective for
static content (i.e. content requiring little or no server-side validation) because
users are presented with content which is served by a server located next to them
(from a geographical perspective). Such an approach also has the benefit of
avoiding potential internationalization issues that may exist in a master system.
Internationalization best practices, such as those suggested by Dr. International
(2003: 409), recommend using ‘Unicode data types’ databases. If such practices
are not followed when developing a database-powered application originally
targeting one locale, it can be extremely difficult (and costly) to make changes to
accommodate additional locale support. Such a change might require selecting a
new encoding, which may lead to increased storage costs. Using a global delivery
network approach alleviates this challenge because the original system does not
have to be modified when systems containing localized content are being created.
Obviously this approach entails giving up some control to the organization
handling the target content so a careful analysis of the pros and cons should be
performed.
Using a distributed infrastructure, or completely separate infrastructure, is an
approach that is increasingly being used by service providers who acknowledge
that the distance between the source and the destination is not the only factor
that may increase latency. The systems that are used to pass requests to servers
may perform a number of data checks, which can slow down the information
exchange. For instance, Evernote decided to set up a dedicated platform in China,
not only to localize the experience from a linguistic perspective, but also to avoid
passing ‘over the Great Chinese Firewall to work’.32 During this infrastructure
change process, the company decided to give the localized platform a different
name Yinxiang Biji (memory notes or impression notes), suggesting that the real
value of the brand is not necessarily in the words used to describe it, but rather
in the user experience provided. In a blog post, the company explained why they
decided to set up a separate service in China, working with Chinese partners
and payment methods to match Chinese Internet expectations, and providing
Chinese-language customer support based in China.33 An additional benefit of
adapting the location of an application’s infrastructure is the compliance with
local laws and regulations. As mentioned in Section 1.1.3, some countries rely
on strict legal data-related frameworks, so ensuring that the handling of data
generated by an app does not violate any law is key.
The latency problem described earlier in this section can be further accentuated
if there is a lot of content to serve (e.g. a Web page with text, multiple graphics and
possibly embedded video files). The situation in 2014 in some parts of the world is
much different from the one described by Yunker (2003: 295) when ‘even in the
U.S., only 10 per cent of all homes have high-speed connections’. At that time,
182 Advanced localization
it was therefore best practice to ensure that Web pages were of average weight (89
KB) by removing unnecessary graphics or functionality. While the situation has
improved a lot as far as home-based Internet connections are concerned, mobile
connections, which are increasingly being used to consume content, tend to be
slower. The situation is likely to change in the years to come, possibly with the
advent of 5G technology, which could be 1,000 times faster than 4G technology.34
For the time being, content publishers, such as game publishers, must remember
that application size matters (when users have to download an application during
the acquisition phase) and that subsequent information exchange with external
services during the usage phase (such as advertising or analytics) will have an
impact on the user experience.

6.5 Conclusions
This chapter covered a number of topics that may not be of primary concern to
translators whose main day-to-day activity is translation. For those translators,
however, who are seeking to diversify their activities by offering additional
services to customers, concepts such as transcreation should be extremely
relevant. Globalization project managers should also be particularly interested in
all of the topics covered in this chapter, since crucial business decisions related
to topics such as culture and location must be taken into account before delving
into a traditional localization process centred around translation. Once again, it
is worth emphasizing that the translation act is only relevant if it serves a specific
need. Whether the need is related to the generation of local content used to
convince a customer to purchase an application or service or to the generation of
support content used to assist customers, the expectations of the target content
consumer(s) should always be made a priority of the person involved in the
translation process. This chapter has hopefully demonstrated that in specific
cases, translation is not sufficient to meet the expectations of a target customer.
Various levels of adaptation (be it at cosmetic or functional level) are often
required to truly localize an application so that it can be competitive against
native applications in a specific domain. The following section offers two tasks
that are related to the topics introduced and discussed in this chapter.

6.6 Tasks
This section is divided into two tasks, covering the topics of transcreation and
functional adaptation.

6.6.1 Understanding transcreation


In this task, you should identify online marketing content in a language that
you are proficient with, but that is not your native language. For example, you
could look for marketing content on a subsection of a multinational company’s
Web site or on the Web site of an indigenous company. Ideally the marketing
Advanced localization 183
content should contain elements other than text (such as images or videos).
You should spend some time trying to analyse this content to determine what
the original intent of the content creator was. In particular, try to identify the
cultural references being used and list the emotions that the author wanted
to provoke. Now that you have identified the message’s main goal and style,
determine how these should be transferred into your native language. Do you
think that the format should be preserved? For instance, if a video was used, do
you think a video would be as effective in your target language? Or would another
format be more appropriate? For instance, an image or a piece of text? Once
you have identified the most appropriate format for the target message, compose
a narrative in order to achieve the same impact as the one used in the source
language. Marketing content tends to be perishable so it is possible that the link
provided in the endnote will be out-of-date. A few minutes of online research,
however, should yield a few interesting candidates.35

6.6.2 Adapting functionality


The goal of this task is to identify the components of a multilingual application
that may be subject to functional adaptation if support for an additional language
of your choice was considered. In order to get started with this task, you should
identify an open-source application that deals with language processing (such as
a language checker or a machine-translation system).36, 37 Ideally, this application
should have a list of supported languages and a guide for developers who are
interested in extending the application. Based on the information provided,
are you able to determine whether your language is already supported? If it is
partially supported, are the instructions on how to extend the default application
sufficient? If it is not currently supported, do you think supporting it might be
possible given the current architecture of the application? In order to answer this
question, you could look for native applications that support your language to try
to identify those components that would be needed.

Notes
1 http://thenextweb.com/insider/2013/03/23/how-we-tripled-our-user-base-by-getting-
localization-right/
2 http://www.amara.org
3 http://pculture.org
4 http://www.ted.com/pages/translation_quick_start
5 http://youtubecreator.blogspot.fr/2013/02/get-your-youtube-video-captions.html
6 https://blogs.adobe.com/globalization/adobe-flash-guidelines/
7 http://makeappmag.com/iphone-app-localization-keywords/
8 https://www.mozilla.org/en-US/styleguide/communications/translation/
9 Since more time is required, the activity may be paid by the hour instead of the
word: http://www.smartling.com/blog/2014/07/21/six-ways-transcreation-differs-
translation/
10 http://blogs.adobe.com/globalization/marketing-localization-at-adobe-what-works-
whats-challenging/
184 Advanced localization
11 http://thecontentwrangler.com/2011/08/23/personas-in-user-experience/
12 http://www.w3.org/International/questions/qa-lang-priorities
13 http://www.w3.org/International/questions/images/fr-lang-settings-ok.png Copyright
© [2012-08-20] World Wide Web Consortium, (Massachusetts Institute of
Technology, European Research Consortium for Informatics and Mathematics,
Keio University, Beihang). All Rights Reserved. http://www.w3.org/Consortium/
Legal/2002/copyright-documents-20021231
14 Copyright © [2012-08-20] World Wide Web Consortium, (Massachusetts Institute
of Technology, European Research Consortium for Informatics and Mathematics,
Keio University, Beihang). All Rights Reserved. http://www.w3.org/Consortium/
Legal/2002/copyright-documents-20021231
15 https://developer.amazon.com/appsandservices/apis/manage/ab-testing
16 https://www.optimizely.com
17 http://www.w3.org/TR/html5/text-level-semantics.html#the-span-element
18 http://returnonnow.com/internet-marketing-resources/2013-search-engine-market-
share-by-country/
19 http://support.apple.com/kb/ht5380
20 http://techcrunch.com/2013/12/07/gamelocalizationchina/
21 http://www.allpago.com/
22 http://www.techinasia.com/evernote-china-alipay/
23 http://www.nltk.org/nltk_data/
24 http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison
25 http://docs.mongodb.org/manual/reference/text-search-languages#text-search-
languages
26 http://www.basistech.com/text-analytics/rosette/base-linguistics/
27 http://www.oracle.com/us/technologies/embedded/025613.htm
28 http://www.clsp.jhu.edu/user_uploads/seminars/Seminar_Pedro.pdf
29 http://en.wikipedia.org/wiki/Hop_(networking)
30 http://www.smartling.com/translation-software-solutions
31 http://localize.reverso.net/Default.aspx?lang=en
32 http://techcrunch.com/2013/05/07/evernote-launches-yinxiang-biji-business-taking-
its-premium-business-service-to-china/
33 http://blog.evernote.com/blog/2012/05/09/evernote-launches-separate-chinese-
service/
34 http://mashable.com/2014/01/26/south-korea-5g/
35 http://bit.ly/ms-xp-support-end
36 https://languagetool.org/languages/
37 http://wiki.apertium.org/wiki/List_of_language_pairs
7 Conclusions

The global software industry, including the localization industry, is going through
many changes, which makes it very different from what it was at the beginning
of the 2000s (or even 2010s). Some of these changes, such as continuous
localization, are extremely disruptive and have a profound impact on the daily
work of translators and localizers. In this last chapter, the topics that have been
covered in Chapters 2, 3, 4, 5 and 6 will be briefly revisited in the light of current
and future trends, such as mobile and cloud computing. As much as possible,
additional research opportunities will also be identified. The second part of this
chapter will attempt to be even more future-facing and briefly discuss some of the
new directions that global application publishers could embrace in the years to
come in order to understand the impact they may have on the world of translators.

7.1 Programming
In Chapter 2, basic programming concepts were introduced for two reasons: to
introduce localizers to key software development concepts (so that they become
more comfortable with technical aspects of the localization process), but also
to introduce technically-oriented localizers to some programming and text
processing techniques that could boost their productivity.
Software development practices are shifting increasingly towards a continuous
delivery model. Regular version updates are increasingly being replaced by a
stream of incremental or disruptive updates. As far as end-users are concerned,
product version numbers do not matter so much – what matters most to them is
the functionality that is provided by a given application. Version numbers are
unlikely to completely disappear since they are useful to determine why users
may be experiencing certain issues. However, software publishers are increasingly
aligning the release of updates with usage data. This is especially the case for
hosted or cloud-based applications, since application or service providers are able
to test in live conditions the impact of an update only with a subset of their user
base. If the outcome of the test is deemed negative, then the update to the whole
user base can be postponed or even discarded.
As far as industry trends are concerned, mobile and cloud computing are
affecting not only the IT industry but also the lives of a large proportion of the
186 Conclusions
world’s population. There have never been more mobile handsets in use in the
world and this increase is unlikely to stop any time soon. A couple of platforms
are currently dominating the mobile market, namely Android and iOS, but the
Windows Phone operating system is likely to challenge these two thanks to
Microsoft’s recent acquisition of Nokia. Other open-source platforms, such as
Firefox OS and Ubuntu, could grow in popularity, especially in emerging markets.
From a translator’s perspective, the proliferation of platforms can only have a
positive impact. If more than one platform exists, then it means localized resources
will be required. Obviously such platforms can share localized resources, such as
those offered by the Unicode Common Locale Data Repository.1 But additional
resources, such as interface strings or help content, will have to be localized.
Cloud computing is another trend that has been affecting both the IT industry
and the language service industry. What used to be performed by local servers
can now be accomplished more easily using scalable, online infrastructures. This
perceived ease of accomplishing computing tasks must, however, be offset against
some of the privacy and security risks that still characterize cloud-based services.
Some entities are still reluctant to fully trust such services and regular data leaks
or surveillance scandals do not improve the situation. Surveillance scandals
could actually have a profound negative impact on the localization industry since
governments, companies or individuals may be tempted to favour local providers
for trust reasons instead of relying on global providers offering localized services.2
As far as the future of the Python programming language is concerned, the debate
about versions 2 and 3 is bound to continue for some time. The choice that was
made and justified in Section 1.5 to introduce version 2 in this book is supported
by voices in the Python community, who maintain that version 3 (especially its
support for Unicode) is not as ideal as may once have been presented.3 In any
case, the support for the 2.x series of Python has recently been extended until
2020. This is both good news and bad news for the Python community. On the
one hand, it gives library developers or code maintainers more time to port their
code base to a new version. On the other hand, it enhances the division between
two community camps, which could lead to an unresolved situation. As far as
novice programmers are concerned, this situation should be acknowledged but
not necessarily seen as a blocker to embrace the language. It has never been easier
to get started with the language in an exploratory and collaborative manner,
thanks to online services such as those presented in Chapter 2 (or others such
as Wakari.IO or nbviewer), so the entry barrier has never been lower.4, 5 Even
if the goal in learning a programming language is not necessarily to compete
with experienced coders, it must be emphasized that being able to automate tasks
using a language without having to rely on a developer can be a great advantage.

7.2 Internationalization
The concept of internationalization was the focus of Chapter 3. The discussion
focused mainly on the way content may be internationalized in order to make
downstream processes (such as translation or adaptation) more efficient.
Conclusions 187
Specifically, global writing guidelines were reviewed from a perspective of
(technical) content authoring. Special emphasis was also placed on the way user
interface strings should be handled in source code so that they can be easily
extracted and localized in multilingual applications. The discussion on functional
adaptation in Chapter 6 also highlighted the advantage of using mature frameworks
and libraries in order to leverage functionality that can handle multiple language
inputs (be it from a text, speech or even graphical perspective) and formats. While
it is possible to argue that mature internationalized frameworks and libraries exist
(such as the Django framework that was introduced in Chapter 3), one may
regret that the use of internationalization features is not enabled by default in
most programming languages. For instance, the Python programming language
allows developers to declare string variables without having to mark them
explicitly (so that they can be extracted using the gettext mechanism). Similarly
the Java programming language does not force developers to use properties files
and resource bundles by default. Since the use of these mechanisms requires
extra typing (and potentially extra overhead if it is not required), it is easy to
understand why it is often ignored when applications are first developed. This is
especially true when applications originate from research prototypes, which are
often developed in a quick and unstructured manner without any guarantee they
can be turned into successful, global solutions. In short, the benefits that can
be achieved by using such mechanisms are often not clear to the person who is
writing the source code. The situation is obviously very different if the use of such
mechanisms is mandated as part of a list of requirements. Again, it is often the
case that the first version of a product will not accommodate such a requirement
for two main reasons:

• It may be better to check the success of an application in one target market


before thinking about targeting other markets, so using one language seems
reasonable.
• If the application becomes successful overnight, it may be preferable to
stagger this success over time by adding support for additional languages at
regular intervals.

Staggering success may seem counter-intuitive, but quick success can have
unwanted consequences. First, the infrastructure supporting a service may not
be ready to accommodate thousands or millions of users so it may be preferable
to restrict access by not supporting certain languages. Second, a company
having to justify growth to shareholders or venture capitalists may prefer to
distribute registrations or downloads over time. Obviously these two reasons
do not mean that internationalization techniques have to be ignored. A careful
global, planning process may mandate the use of these techniques and postpone
localization activities for a later phase. However, it is extremely easy to neglect
such techniques when the requirement to serve global users competes with other
equally important requirements (e.g. improving the user experience, the stability
of an infrastructure or the security of an application or service).
188 Conclusions
To the author’s knowledge, no programming language has been designed
with default internationalization principles in mind, but this situation may
change in the future. Designing such a language would be challenging because
its capability may be limited by the environment (e.g. the operating system) on
which it would be executed. However, this prospect may be more comforting
than the situation that affects many programming languages, even very popular
ones. One good example is JavaScript, whose level of internationalization
maturity is very low (due to issues caused by legacy browsers, the lack of well-
established specifications and a multitude of tools and utilities that tend to
reinvent the wheel). This situation is problematic from multiple perspectives:
first, it may discourage developers to provide internationalized support by
default because navigating the complexity of specifications and libraries can be
an extremely daunting task. Second, it can lead to a situation where it seems
easier for a developer to come up with their own scheme, which contributes
to this unfortunate status-quo. Even if the possibility of having a global
programming language does not materialize, developing a resource repository
recommending best practices in terms of internationalization per programming
language would be extremely valuable. While a working group from the W3C
consortium specializes in internationalization-related topics as far as Web
technologies are concerned, its work is mostly limited to markup languages,
such as HTML and XML, so there seems to be a gap as far as programming
languages are concerned. The Unicode consortium, whose goal is to enable
people around the world to use computers in any language, may have a role to
play in such an endeavour.
From a research perspective, certain internationalization-related questions
are still open. While the scanning of source code to detect unmarked strings is
well understood (and supported by multiple tools), detecting internationalization
issues due to the use of existing or new functions may be less straightforward to
accomplish. For instance, a developer writing a global travel booking application
may use functions that process user input. For example, a normalization function
may be used to automatically correct spelling mistakes that may have been
made by a user when searching for a destination. From an internationalization
perspective, should (and if so, how?) the developer be alerted when they write
code in a locale-specific manner? One could argue that this type of work is the
remit of quality assurance, but global efficiency gains may be realized if these
issues were taken care of during the development process.

7.3 Localization
Chapter 4 focused on the language-based localization processes affecting mostly
text present in an application’s user interfaces and its associated documentation
content. These processes fall into two categories. The first one is related to
sequential workflows involving the extraction strings for translation before they
can be merged back into resources required to build a target or multilingual
application. The other approach consists of using a more visual approach so
Conclusions 189
that translation can be performed in-context. While the second approach is
currently limited to desktop or Web-based applications, it is likely to gain in
popularity, especially if mobile applications can be localized in such a way in the
future (by possibly leveraging emulators, which can duplicate the behaviour of a
mobile application, say, in a Web browser). The proliferation of cloud services is
obviously also favouring this second approach since test or staging environments
where translators can translate in-context can now be set-up in seconds (instead
of weeks or days).

7.4 Translation
In Chapter 5, multiple translation technologies were discussed, since these
technologies tend to make translators more productive. Expectations around
translation turnaround times will always be more and more aggressive, which is
to be expected because the timeliness of a translation contributes greatly to its
usefulness. Obviously, other factors such as quality are to be taken into account,
which is why the use of machine translation technology is often coupled
with a post-editing process, during which translators are expected to validate
or edit translations. Such a task is obviously very different in nature from a
translation task whose goal is to create target text that does not strictly follow
the structure of the source text. While recent advances in machine translation
have made its use ubiquitous (especially in situations when users are unwilling
to pay for any direct human intervention), human validation is still required in
situations where information accuracy is critical. Recent research efforts have
therefore focused on investigating whether it is possible to (i) identify those
parts of documents that require editing and (ii) possibly determine whether
the editing would require more effort than translating from scratch. MT quality
estimation has made progress is recent years (Soricut et al. 2012; Rubino et
al. 2013). However, more work is required to improve the accuracy of such
systems, especially as far as the second task is concerned. Without relying on
external characteristics (such as the domain knowledge of the post-editor, how
familiar and enthusiastic the post-editor is about post-editing, or even how fresh
the post-editor is to complete the task), it seems very difficult to rely purely
on textual or system-dependent features to determine how much effort would
be required. Ultimately, one could argue that it should not matter whether a
translator decides to post-edit or re-translate from scratch a segment that has
been deemed to be of insufficient quality by a prediction system. But it does
matter if the translator is not paid fairly for the amount of time spent on the
task. The compensation of post-editing is indeed a topic of debate because it is
poorly supported by the traditional model based on word count and translation
memory matching. For this reason, it is difficult to imagine how sustainable
human post-editing services offering flat fees of a few cents per word will be.6
It seems that the amount of time spent on a task seems a more appropriate way
to compensate workers. Obviously time-tracking is not without its pitfalls (e.g.
what happens when somebody takes a coffee break or answers an email about
190 Conclusions
a new task request?), but some post-editing systems, such as the ACCEPT
system, have the ability to track how much time was spent on a given segment
(Roturier et al. 2013). Another aspect of post-editing, which may require further
investigations in the years to come, is the environment where the post-editing
task is conducted. Traditional translation environments have been desktop-
based for years and have recently been challenged by Web-based environments
thanks to increased network speeds. However, mobile-based environments
are now being considered as an alternative to mature environments favoured
by professional translators. For instance, the first version of a post-editing
application specifically designed for a mobile environment, Kanjingo, was
recently tested with a few users and received positive feedback (although areas
of improvement such as the ability to leverage functionality such as auto-
completion and synonyms were mentioned) (O’Brien et al. 2014).

7.5 Adaptation
Adaptation is a very generic term that can encompass many different activities
required to create a multilingual application (or transform an existing
application into a multilingual one). The emergence of transcreation as an
activity is not surprising since the competition between global and local
companies has never been as fierce. While some companies may have gotten
away with source-oriented translated messaging in the past (simply because
there was little or no competition), in order to win, locally targeted or
personalized messages must now be used. As far as video content is concerned,
in-context subtitling has become a mature process. One of the next challenges
would be to investigate the feasibility of in-context dubbing, since this mode
of communication may be favoured in certain locales. Chapter 6 also showed
that adaptation is not limited to the content used to convince users to buy a
particular application or service. Some resources, which may be core to the
functionality of an application, sometimes have to be adapted in order to truly
meet (or even exceed) the needs and expectations of global users. Whether
the adaptation of functionality should be considered as a localization-related
activity (as it was in this book) or an internationalization activity is a moot
point. What matters for end-users is that the applications they have decided
to install (possibly after purchasing them) behave in a way that is consistent
with the environment in which they operate. As surprising as it may seem,
the localization literature focuses predominantly on the textual aspect of the
localization process rather than on its functional aspect. More research seems
therefore required to understand better how a global application differs from
a native application in terms of behaviour. Even though the features lists of
competing applications can be used as a starting point to identify overlaps and
gaps, thorough functional evaluations could be envisaged in order to highlight
areas where systems or applications perform significantly differently. Since
many systems now expose functionality through APIs, it might even be possible
to semi-automate such comparative evaluations.7
Conclusions 191

7.6 New directions


To conclude this book, one may be tempted to look into the future to try and
identify trends that could disrupt the localization industry in the next five to
ten years. Two of these trends have been touched on in previous chapters and
will be revisited in the next two sections: real-time localization and non-textual
localization.

7.6.1 Towards real-time text localization


Real-time translation has always been one of the goals of machine translation
developers. This is due to the fact that translation is sometimes required at a
specific moment, and when that moment passes, the translation need vanishes.
One can think of menus in restaurants in a country whose main local language
does not match the language(s) spoken by a visitor. Over the last few years, a
number of applications, such as the Google Translate application for mobile,
have appeared on the market in order to offer real-time translations by allowing
users to enter text manually or by taking pictures of words.8 While these
applications can be useful in certain contexts, the quality they provide remains
a source of concern in situations where information accuracy is of paramount
importance. For this reason, one could envisage these applications evolving in
such a way that the translation output they produce could be validated on-the-
fly by qualified, domain-expert translators. This type of service would be akin to
the instant interpreting services that can be accessed by calling a phone number.
This scenario would not be limited to mobile applications and could be extended
to any type of content that is present on the Web. In order to have a truly
multilingual Web, users must be able to find and interact with content in the
language(s) they are comfortable with. In an ideal world, all content would be
available in any language at an acceptable quality level by relying on traditional
translation workflows. However, the sheer amount of content available online
makes this goal an unrealistic one. What may be more realistic, however, is the
ability to have translations produced by automated systems verified by a large
pool (or crowd) of human translators. For instance, this crowd-based concept
has been investigated in the area of word processing with mixed success, since
Bernstein et al. (2010) found that crowd workers were able to detect spelling
and grammar errors that standard checkers were unable to find. Obviously, the
work produced by crowd workers can vary greatly so people with a desired skill-
set would need to be targeted. Since more and more translations are provided by
online services that make human translation or post-editing available as an API
call, it is possible that such a scenario could be envisaged in a not-too-distant
future. One of the main challenges with such an approach, however, would be
trying to anticipate where parts of a translated document or application may pose
comprehension problems to an end-user. The work on quality estimation that
was introduced in Section 5.7.5 may help in this area. One could also envisage
relying on visual cues to (i) infer specific user characteristics (such as their ability
192 Conclusions
to perform specific tasks, as described by Toker et al. (2013)) or (ii) detect that
a user is actually having difficulty understanding a translated segment. Using
gaze data to achieve this goal seems more intuitive than having to rely on the
user to acknowledge comprehension difficulties (for example by pressing a
confusion button as used in experiments conducted by Conati et al. (2013) in
the area of visualization software). By taking into account the fact that gaze data
are not always reliable to detect confusion, one could envisage relying both on
eye-tracking technology and facial expression recognition to detect confusion.
The access to such technologies is made easier with the fact that more and
more devices are equipped with Web cams (including glasses). However, these
technologies are obviously extremely intrusive and would pose serious privacy
issues, so their role in collecting real translation user feedback (through gaze
data) requires further research. Ultimately, it is clear that the translation model
is likely to move away from a push model to a pull model. Traditionally, the push
model has proved popular because it was probably easier to translate as much
content as possible (depending on human and financial resources available)
rather than anticipate what content would be useful in a translated form. As more
and more translation usage and user data are being collected, however, the pull
model presents some advantages because translations can be triggered when there
seems to be a clear need for them (e.g. when a user requests a Web page using a
browser with specific language preferences). As mentioned earlier, however, the
main challenge lies in making content of acceptable quality available to users
as quickly as possible. If this challenge can be addressed successfully for simple
translation tasks, there is no reason why the model could not apply to more
advanced localization tasks (e.g. by personalizing or adapting the content that
will be made available to a particular user as suggested in Steichen and Wade
(2010)) or other communication modalities.

7.6.2 Beyond text localization


In this book, the focus has been mainly on text, even though other communication
forms such as graphics and videos have been discussed at some level. While text
is likely to remain a very important medium of communication in the foreseeable
future, it is conceivable that other types of interactions will become popular
as new computing devices and paradigms emerge. As aforementioned, spoken
input is gaining in popularity thanks to the pervasiveness of mobile devices and
applications that fulfil the role of personal assistants (such as Apple’s Siri or
Microsoft’s Cortana).9 Being able to deal with multiple accents and idiosyncratic
oral language presents non-trivial adaptation challenges (both from an
internationalization and localization perspective). Such challenges are likely to
apply to another type of modality if interactions with devices and systems become
increasingly based on gestures. Hand and body gestures can vary greatly from one
country to another (or even from one region to another) so a future challenge
will lie in the ability to associate a meaning to a particular gesture based on a
user profile in order to transfer that meaning accurately during a localization
Conclusions 193
process. In a way, this is something that interpreters have had to do for years
(i.e. not only translating words but also sometimes the meaning that is expressed
by body language) so it will be interesting to monitor whether the future role of
translators may overlap with that of interpreters.

Notes
1 http://cldr.unicode.org/
2 h t t p : / / w w w. r e u t e r s . c o m / a r t i c l e / 2 0 1 5 / 0 2 / 2 5 / u s - c h i n a - t e c h - e x c l u s i v e -
idUSKBN0LT1B020150225
3 http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
4 https://www.wakari.io/
5 http://nbviewer.ipython.org/
6 https://www.unbabel.com/
7 http://www.programmableweb.com/
8 https://www.google.ie/mobile/translate/
9 http://readwrite.com/2014/04/16/microsoft-cortana-siri-google-now
Bibliography

Adams, A., Austin, G., and Taylor, M. (1999). Developing a resource for multinational
writing at Xerox Corporation. Technical Communication, pages 249–54.
Adriaens, G. and Schreurs, D. (1992). From Cogram to Alcogram: Toward a controlled
English grammar checker. In Proceedings of the 14th International Conference on
Computational Linguistics, COLING 92, pages 595–601, Nantes, France.
Aikawa, T., Schwartz, L., King, R., Corston-Oliver, M., and Lozano, C. (2007). Impact
of controlled language on translation quality and post-editing in a statistical machine
translation environment. In Proceedings of MT Summit XI, pages 1–7, Copenhagen,
Denmark.
Alabau, V. and Leiva, L. A. (2014). Collaborative Web UI localization, or how to build
feature-rich multilingual datasets. In Proceedings of the 17th Annual Conference of the
European Association for Machine Translation (EAMT’l4), pages 151–4, Dubrovnik,
Croatia.
Alabau, V., Leiva, L. A., Ortiz-Mart, D., and Casacuberta, F. (2012). User evaluation of
interactive machine translation systems. In Proceedings of the 16th EAMT Conference,
pages 20–3, Trento, Italy.
Allen, J. (1999). Adapting the concept of ‘translation memory’ to ‘authoring memory’ for a
controlled language writing environment. In Translating and the Computer 21: Proceedings
of the Twenty-First International Conference on ‘Translating and the Computer’, London.
Allen, J. (2001). Post-editing: an integrated part of a translation software program.
Language International, April, pages 26–9.
Allen, J. (2003). Post-editing. In Somers, H., editor, Computers and Translation: A
Translator’s Guide, pages 297–317, John Benjamins Publishing Company, Amsterdam.
Amant, K. S. (2003). Designing effective writing-for-translation intranet sites. IEEE
Transactions on Professional Communication, 46(1): 55–62.
Arnold, D., Balkan, L., Meijer, S., Humphreys, R., and Sadler, L. (1994). Machine
Translation: an Introductory Guide. Blackwells-NCC, London.
Austermuhl, F. (2014). Electronic Tools for Translators. Routledge, London.
Aziz, W., Castilho, S., and Specia, L. (2012). PET: a tool for post-editing and assessing
machine translation. In Calzolari, N., Choukri, K., Declerck, T., Dogan, M. U.,
Maegaard, B., Mariani, J., Odijk, J., and Piperidis, S., editors, Proceedings of the Eighth
International Conference on Language Resources and Evaluation (LREC-2012), pages
3982–7, Istanbul, Turkey. European Language Resources Association (ELRA), Paris.
Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., Lagarda,
A. L., Ney, H., Tomás, J., Vidal, E., and Vilar, J. M. (2009). Statistical approaches to
computer-assisted translation. Computational Linguistics, 35(l): 3–28.
Bibliography 195
Barreiro, A., Scott, B., Kasper, W., and Kiefer, B. (2011). OpenLogos machine translation:
philosophy, model, resources and customization. Machine Translation, 25(2): 107–26.
Baruch, T. (2012). Localizing brand names. MultiLingual, 23(4): 40–2.
Bel, N., Papavasiliou, V., Prokopidis, P., Toral, A., and Arranz, V. (2013). Mining and
exploiting domain-specific corpora in the panacea platform. In BUCC 2012, The 5th
Workshop on Building and Using Comparable Corpora: “Language Resources for Machine
Translation in Less-Resourced Languages and Domains”, pages 24–6, Istanbul, Turkey.
Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B., Ackerman, M. S., Karger, D.
R., Crowell, D., and Panovich, K. (2010). Soylent: a word processor with a crowd
inside. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software
and Technology, pages 313–22, ACM, New York.
Bernth, A. (1998). EasyEnglish: Preprocessing for MT. In Proceedings of the Second
International Workshop on Controlled Language Applications (CLAW 1998), pages 30–41,
Pittsburgh, PA.
Bernth, A. and Gdaniec, C. (2002). MTranslatability. Machine Translation, 16: 175–218.
Bernth, A. and McCord, M. C. (2000). The effect of source analysis on translation
confidence. In White, J., editor, Envisioning Machine Translation in the Information
Future: Proceedings of the 4th Conference of the Association for MT in the Americas,
AMTA 2000, Cuernavaca, Mexico, pages 89–99, Springer-Verlag, Berlin, Germany, .
Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. O’Reilly
Media, Inc., Sebastopol, CA, 1st edition.
Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and
Ueffing, N. (2004). Confidence estimation for machine translation. In Proceedings of the
20th International Conference on Computational Linguistics, pages 315–21, Association
for Computational Linguistics, Stroudsburg, PA.
Bowker, L. (2005). Productivity vs quality? a pilot study on the impact of translation
memory systems. Localisation Focus 4(1): 13–20.
Brown, P. E., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The mathematics
of statistical machine translation: Parameter estimation. Computational Linguistics, 19:
263–311.
Bruckner, C. and Plitt, M. (2001). Evaluating the operational benefit of using machine
translation output as translation memory input. In MT Summit VIII, MT evaluation:
who did what to whom (Fourth ISLE workshop), pages 61–5, Santiago de Compostela,
Spain.
Byrne, J. (2004). Textual Cognetics and the Role of Iconic Linkage in Software User
Guides. PhD thesis, Dublin City University, Dublin, Ireland.
Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2012).
Findings of the 2012 workshop on statistical machine translation. In Proceedings of the
Seventh Workshop on Statistical Machine Translation, pages 10–51, Montreal, Canada,
Association for Computational Linguistics, New York.
Carl, M. (2012). Translog-II: a program for recording user activity data for empirical
reading and writing research. In Proceedings of the Eighth International Conference on
Language Resources and Evaluation (LREC-2012), pages 4108–12, Istanbul, Turkey,
European Language Resources Association (ELRA), Paris.
Casacuberta, F., Civera, J., Cubel, E., Lagarda, A. L., Lapalme, G., Macklovitch, E.,
and Vidal, E. (2009). Human interaction for high-quality machine translation.
Communications of the ACM – A View of Parallel Computing, 52(10): 135–8.
Chandler, H. M., Deming, S. O., et al. (2011). The Game Localization Handbook. Jones &
Bartlett Publishers, Sudbury, MA.
196 Bibliography
Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation.
In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
(ACL ’05), pages 263–70, Morristown, NJ, Association for Computational Linguistics,
New York.
Choudhury, R. and McConnell, B. (2013). TAUS translation technology landscape
report. Technical report, TAUS, Amsterdam.
Clémencin, G. (1996). Integration of a CL-checker in a operational SGML authoring
environment. In Proceedings of the First Controlled Language Application Workshop
(CLAW 1996), pages 32–41, Leuven, Belgium.
Collins, L. and Pahl, C. (2013). A service localisation platform. In SERVICE
COMPUTATION 2013, The Fifth International Conferences on Advanced Service
Computing, pages 6–12, IARIA, Wilmington, NC.
Conati, C., Hoque, E., Toker, D., and Steichen, B. (2013). When to adapt: Detecting
user’s confusion during visualization processing. In Proceedings of 1st International
Workshop on User-Adaptive Visualization (WUAV 2013), Rome, Italy.
D’Agenais, J. and Carruthers, J. (1985). Creating Effective Manuals. South-Western Pub,
Co, Cincinnati, OH.
Deitsch, A. and Czarnecki, D. (2001). Java Internationalization. O’Reilly Media, Inc,
Sebastopol, CA.
Denkowski, M. and Lavie, A. (2011). Meteor 1.3: Automatic metric for reliable
optimization and evaluation of machine translation systems. In Proceedings of the
EMNLP 2011 Workshop on Statistical Machine Translation, Edinburgh, U.K.
DePalma, D. A. (2002). Business Without Borders. John Wiley & Sons, Inc., New York.
DePalma, D., Hegde, V., and Stewart, R. G. (2011). How much does global contribute to
revenue? Technical report, Common Sense Advisory, Lowell, MA.
Dirven, R. and Verspoor, M. (1998). Cognitive Exploration of Language and Linguistics. John
Benjamins Publishing, Amsterdam.
Dombek, M. (2014). A study into the motivations of internet users contributing to
translation crowdsourcing: the case of Polish Facebook user-translators. PhD thesis,
Dublin City University.
Drugan, J. (2014). Quality in Professional Translation. Bloomsbury Academic, London.
Dunne, K. (2011a). From vicious to virtuous cycle customer-focused translation quality
management using iso 9001 principles and agile methodologies. In Dunne, K. J. and
Dunne, E., editors, Translation and Localization Project Management: The Art of the
Possible, pages 153–88, John Benjamins Publishing, Amsterdam
Dunne, K. (2011b). Managing the fourth dimension: Time and schedule in translation and
localization project. In Dunne, K. J. and Dunne, E., editors, Translation and Localization
Project Management: The Art of the Possible, pages 119–52, American Translators
Association Scholarly Monograph Series, John Benjamins Publishing, Amsterdam.
Dunne, K. J. and Dunne, E. S. (2011). Translation and Localization Project Management:
The Art of the Possible. John Benjamins Publishing, Amsterdam.
Elming, J. and Bonk, R. (2012). The Casmacat workbench: a tool for investigating the
integration of technology in translation. In Proceedings of the International Workshop
on Expertise in Translation and Post-editing – Research and Application, Copenhagen,
Denmark.
Esselink, B. (2000). A Practical Guide to Localization. John Benjamins Publishing, Amsterdam.
Esselink, B. (2001). Web design: Going native. Language International, 2: 16–18.
Esselink, B. (2003a). The evolution of localization. The Guide from Multilingual Computing
& Technology: Localization, 14(5): 4–7.
Bibliography 197
Esselink, B. (2003b). Localisation and translation. In Somers, H., editor, Computers and
Translation: A Translator’s Guide, pages 67–86, John Benjamins Publishing, Amsterdam.
Federico, M., Bertoldi, N., and Cettolo, M. (2008). IRSTLM: an open source toolkit for
handling large scale language models. In Interspeech ’08, pages 1618–21.
Federmann, C., Eisele, A., Uszkoreit, H., Chen, Y., Hunsicker, S., and Xu, J. (2010).
Further experiments with shallow hybrid MT systems. In Proceedings of the Joint Fifth
Workshop on Statistical Machine Translation and MetricsMATR, pages 77–81, Uppsala,
Sweden, Association for Computational Linguistics, New York.
Flournoy, R. and Duran, C. (2009). Machine translation and document localization
production at Adobe: From pilot to production. In Proceedings of the Machine Translation
Summit XII, Ottawa, Canada.
Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz,
J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., and Tyers, F. M. (2011). Apertium:
a free/open-source platform for rule-based machine translation. Machine Translation,
25(2): 127–44.
Friedl, J. (2006). Mastering Regular Expressions. O’Reilly Media, Inc., Sebastopol, CA, 3rd
edition.
Fukuoka, W., Kojima, Y., and Spyridakis, J. (1999). Illustrations in user manuals: Preference
and effectiveness with Japanese and American readers. Technical Communication,
46(2): 167–76.
Gallup, O. (2011). User language preferences online. Technical report, European
Commission, Brussels.
Gauld, A. (2000). Learn to Program Using Python: A Tutorial for Hobbyists, Self-Starters, and
All Who Want to Learn the Art of Computer Programming. Addison-Wesley Professional,
Reading, MA
Gdaniec, C. (1994). The Logos translatability index. In Technology Partnerships for
Crossing the Language Barrier: Proceedings of the First Conference of the Association for
Machine Translation in The Americas, pages 97–105, Columbia, MD.
Gerson, S. J. and Gerson, S. M. (2000). Technical Writing: Process and Product. Prentice
Hall, Upper Saddle River, NJ.
Giammarresi, S. (2011). Strategic views on localisation project management: The
importance of global product management and portfolio management. In Dunne, K. J
and Dunne, E., editors, Translation and Localization Project Management: The Art of the
Possible, pages 17–50, American Translators Association Scholarly Monograph Series,
John Benjamins Publishing Company, Amsterdam.
Godden, K. (1998). Controlling the business environment for controlled language. In
Proceedings of the Second Controlled Language Application Workshop (CLAW), pages
185–9, Pittsburgh, PA
Godden, K. and Means, L. (1996). The controlled automotive service language (CASL)
project. In Proceedings of the First Controlled Language Application Workshop (CLAW
1996), pages 106–14, Leuven, Belgium.
Hall, B. (2009). Globalization Handbook for the Microsoft .Net Platform. CreateSpace.
Hammerich, I. and Harrison, C. (2002). Developing Online Content: The Principles of
Writing and Editing for the Web. John Wiley & Sons, Inc., Toronto, Canada.
Hayes, P., Maxwell, S., and Schmandt, L. (1996). Controlled English advantages for
translated and original English documents. In Proceedings of the First Controlled Language
Application Workshop (CLAW 1996), pages 84–92, Leuven, Belgium.
He, Y., Ma, Y., Roturier, J., Way, A., and van Genabith, J. (2010). Improving the post-
editing experience using translation recommendation: a user study. In Proceedings of the
198 Bibliography
Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010),
pages 247–56, Denver, CO, Association for Machine Translation in the Americas.
Hearne, M. and Way, A. (2011). Statistical machine translation: A guide for linguists and
translators. Language and Linguistics Compass, 5(5): 205–26.
International, D. (2003). Developing International Software. Microsoft Press, Redmond,
WA, 2nd edition.
Jiménez-Crespo, M. A. (2011). From many one: Novel approaches to translation quality
in a social network era. In O’Hagan, M., editor, Linguistica Antverpiensia New Series –
Themes in Translation Studies: Translation as a Social Activity – Community Translation
2.0, pages 131–52, Artesis University College, Antwerp.
Jiménez-Crespo, M. A. (2013). Translation and Web Localization. Routledge, London.
Kamprath, C., Adolphson, E., Mitamura, T., and Nyberg, E. (1998). Controlled language
for multilingual document production: Experience with caterpillar technical English.
In CLAW ’98: 2nd International Workshop on Controlled Language Applications,
Pittsburgh, PA.
Kaplan, M. (2000). Internationalization with Visual Basic: The Authoritative Solution. Sams
Publishing, Indianopolis, IN.
Karsch, B. I. (2006). Terminology workflow in the localization process. In Dunne, K.
J., editor, Perspectives on Localization, pages 173–91, John Benjamins Publishing,
Amsterdam.
Kelly, N., Rav, R., and DePalma, D. A. (2011). From crawling to sprinting: Community
translation goes mainstream. In O’Hagan, M., editor, Linguistica Antverpiensia New
Series – Themes in Translation Studies: Translation as a Social Activity – Community
Translation 2.0, pages 75–94, Artesis University College, Antwerp, 10th edition.
Knight, K. and Chander, I. (1994). Automated post-editing of documents. In Proceedings
of the Twelfth National Conference on Artificial Intelligence (Vol. 1), pages 779–84,
American Association for Artificial Intelligence, Seattle, WA.
Koehn, P. (2010a). Enabling monolingual translators: Post-editing vs. options. In Proceedings
of Human Language Technologies: The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pages 537–45, Los Angeles, CA.
Koehn, P. (2010b). Statistical Machine Translation. Cambridge University Press, Cambridge.
Koehn, P., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Moran,
C., Dyer, C., Constantin, A., and Herbst, E. (2007). Moses: Open source toolkit for
statistical machine translation. In ACL-2007: Proceedings of Demo and Poster Sessions,
Prague, Czech Republic.
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology – NAACL ’03, pages 48–54,
Association for Computational Linguistics, Morristown, NJ.
Kohavi, R., Longbotham, R. Sommerfield, D., and Henne, R. M. (2009). Controlled
experiments on the web: survey and practical guide. Data Mining and Knowledge
Discovery, 18(1): 140–81.
Kohl, J. R. (2008). The Global English Style Guide: Writing Clear, Translatable Documentation
for a Global Market. SAS Institute, Cary, NC.
Krings, H. P. (2001). Repairing Texts: Empirical Investigations of Machine Translation Post-
Editing Process. The Kent State University Press, Kent, OH.
Kumaran, A., Saravanan, K., and Maurice, S. (2008). wikiBABEL: community creation
of multilingual data. In Proceedings of the Fourth International Symposium on Wikis,
WikiSym ’08, New York, NY, ACM, New York.
Bibliography 199
Künzli, A. (2007). The ethical dimension of translation revision, an empirical study. The
Journal of Specialised Translation, 8: 42–56.
Lagoudaki, E. (2009). Translation editing environments. In MT Summit XII, The Twelfth
Machine Translation Summit: Beyond Translation Memories: New Tools for Translators
Workshop, Ottawa, Canada.
Langlais, P. and Lapalme, G. (2002). TransType: Development-evaluation cycles to boost
translator’s productivity. Machine Translation, 17(2): 77–98.
Lardilleux, A. and Lepage, Y. (2009). Sampling-based multilingual alignment. In
International Conference on Recent Advances in Natural Language Processing (RANLP
2009), Borovets, Bulgaria.
Lo, C.-K. and Wu, D. (2011). MEANT: An inexpensive, high-accuracy, semi-automatic
metric for evaluating translation utility via semantic frames. In Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1, pages 220–9, Association for Computational Linguistics,
Morristown. NJ.
Lombard, R. (2006). A practical case for managing source-language terminology.
In Dunne, K. J., editor, Perspectives on Localization, pages 155–71, John Benjamins
Publishing, Amsterdam.
Lutz, M. (2009). Learning Python. O’Reilly & Associates, Inc., Sebastopol, CA, 4th
edition.
Lux, V. and Dauphin, E. (1996). Corpus studies: a contribution to the definition of a
controlled language. In Proceedings of the First Controlled Language Application Workshop
(CLAW 1996), pages 193–204, Leuven, Belgium.
McDonough Dolmaya, J. (2011). The ethics of crowdsourcing. In O’Hagan, M., editor,
Linguistica Antverpiensia New Series – Themes in Translation Studies: Translation as a
Social Activity – Community Translation 2.0, pages 97–110, Artesis University College,
Antwerp, 10th edition.
McNeil, J. (2010). Python 2.6 Text Processing: Beginners Guide. Packt Publishing Ltd,
Birmingham.
Melby, A. K. and Snow, T. A. (2013). Linport as a standard for interoperability between
translation systems. Localisation Focus, 12(l): 50–55.
Microsoft (2011). French Style Guide. Microsoft, Redmond,WA.
Mitamura, T., Nyberg, E. and Carbonell, J. (1991). An efficient interlingua translation
system for multilingual document production. In Proceedings of the Third Machine
Translation Summit, Washington, DC, pages 2–4.
Moore, C. (2000). Controlled language at Diebold Incorporated. In Proceedings of the
Third International Workshop on Controlled Language Applications (CLAW 2000), pages
51–61, Seattle, WA.
Moorkens, J. (2011). Translation memories guarantee consistency: Truth or fiction? In
Proceedings of ASLIB 2011, London.
Moorkens, J. and O’Brien, S. (2013). User attitudes to the post-editing interface.
In O’Brien, S., Simard, M., and Specia, L., editors, Proceedings of MT Summit XIV
Workshop on Post-editing Technology and Practice, pages 19–25, Nice, France.
Muegge, U. (2001). The best of two worlds: Integrating machine translation into standard
translation memories. a universal approach based on the TMX standard. Language
International, 13(6): 26–9.
Myerson, C. (2001). Global economy: Beyond the hype. Language International, 1: 12–15.
Nielsen, J. (1999). Designing Web Usability: The Practice of Simplicity. New Riders
Publishing, Thousand Oaks, CA.
200 Bibliography
Nyberg, E., Mitamura, T., and Huijsen, W. O. (2003). Controlled language for authoring
and translation. In Somers, H., editor, Computers and Translation: A Translator’s Guide,
pages 245–81, John Benjamins Publishing Company, Amsterdam.
O’Brien, S. (2002). Teaching post-editing: A proposal for course content. In 6th EAMT
Workshop ‘Teaching Machine Translation’, Manchester, pages 99–106.
O’Brien, S. (2003). Controlling controlled English: An analysis of several controlled
language rule sets. In Proceedings of EAMT-CLAW-03, pages 105–14, Dublin, Ireland.
O’Brien, S. (2014). Error typology benchmarking report. Technical report, TAUS Labs,
Amsterdam.
O’Brien, S. and Schäler, R. (2010). Next generation translation and localization: Users
are taking charge. In Proceedings of Translating and the Computer 32, Aslib, London.
O’Brien, S., Moorkens, J., and Vreeke, J. (2014). Kanjingo – a mobile app for post-editing.
In EAMT2014: The Seventeenth Annual Conference of the European Association for
Machine Translation (EAMT), pages 137–41, Dubrovnik, Croatia.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In
Proceedings of the 41st Annual Meeting on Association for Computational Linguistics,
volume 1, pages 160–7, Sapporo, Japan.
Och, F. J. and Ney, H. (2002). Discriminative training and maximum entropy models for
statistical machine translation. In Proceedings of the 40th Annual Meeting on Association
for Computational Linguistics, pages 295–302, Association for Computational
Linguistics, Stroudsburg, PA.
Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1): 19–51.
Ogden, C. K. (1930). Basic English: A General Introduction with Rules and Grammar. Paul
Treber, London.
O’Hagan, M. and Ashworth, D. (2002). Translation-mediated Communication in a Digital
World: Facing the Challenges of Globalization and Localization, volume 23, Multilingual
Matters, Clevedon.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL 2002), pages 311–18, Philadelphia, PA.
Pedersen, J. (2009). A subtitler’s guide to translating culture. MultiLingual, 20(3): 44–48.
Perez, F. and Granger, B. E. (2007). IPython: a system for interactive scientific computing.
Computing in Science & Engineering, 9(3): 21–9.
Perkins, J. (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing,
Birmingham.
Pfeiffer, S. (2010). The Definitive Guide to HTML5 Video. Apress, New York.
Pym, A. (2004). The Moving Text: Localization, Translation, and Distribution, volume 49,
John Benjamins Publishing, Amsterdam.
Pym, P. J. (1990). Preediting and the use of simplified writing for MT: an engineer’s
experience of operating an MT system. In Mayorcas, P., editor, Translating and the
Computer 10: The Translation Environment 10 Years on, pages 80–96, ASLIB, London.
Raman, M. and Sharma, S. (2004). Technical Communication: Principles and Practice.
Oxford University Press, Oxford.
Ray, R. and Kelly, N. (2010). Reaching New Markets Through Transcreation. Common
Sense Advisory, Lowell, MA.
Richardson, S. D. (2004). Machine translation of online product support articles using
a data-driven MT system. In Frederking, R. and Taylor, K., editors, Proceedings of the
Bibliography 201
6th Conference of the Association for MT in the Americas, AMTA 2004, pages 246–51,
Washington, DC, Springer-Verlag, New York.
Rockley, A., Kostur, P., and Manning, S. (2002). Managing Enterprise Content: A Unified
Content Strategy. New Riders, Indianapolis, IN.
Roturier, J. (2006). An investigation into the impact of controlled English rules on
the comprehensibility, usefulness and acceptability of machine-translated technical
documentation for French and German users. PhD thesis, Dublin City University,
Ireland.
Roturier, J. (2009). Deploying novel MT technology to raise the bar for quality: A review
of key advantages and challenges. In MT Summit XII: Proceedings of the Twelfth Machine
Translation Summit, Ottawa, Canada.
Roturier, J. and Lehmann, S. (2009). How to treat GUI options in IT technical texts for
authoring and machine translation. The Journal of Internationalisation and Localisation,
1: 40–59.
Roturier, J., Mitchell, L., and Silva, D. (2013). The ACCEPT post-editing environment:
a flexible and customisable online tool to perform and analyse machine translation
post-editing. In O’Brien, S., Simard, M., and Specia, L., editors, Proceedings of the MT
Summit XIV Workshop on Post-editing Technology and Practice (WPTP 2013), Nice,
France.
Rubino, R., Wagner, J., Foster, J., Roturier, J., Samad Zadeh Kaljahi, R. and Hollowood,
F. (2013). DCU-Svmantec at the WM T 2013 quality estimation shared task. In
Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 392–7, Sofia,
Bulgaria.
Savourel, Y. (2001). XML Internationalization. Sams, Indianopolis, IN.
Schwitter, R. (2002). English as a formal specification language. In Proceedings of the 13th
International Workshop on Database and Expert Systems Applications, pages 228–32.
Senellart, J., Yang, J., and Rebollo, A. (2003). Systran intuitive coding technology. In
Proceedings of MT Summit X, New Orleans, LA.
Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. (2007). Rule-based translation with
statistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical
Machine Translation – StatMT ’07, pages 203–6, Association for Computational
Linguistics, Morristown, NJ.
Smith, J., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., and Lopez, A.
(2013). Dirt cheap web-scale parallel text from the common crawl. In Proceedings of
ACL 2013, Sofia, Bulgaria.
Smith-Ferrier, G. (2006). .NET Internationalization: The Developer’s Guide to Building
Global Windows and Web Applications. Addison-Wesley Professional, Upper Saddle
River, NJ.
Snover, M., Dorr, B., Schwartz, R. Micciulla, L., and Makhoul, J. (2006). A study of
translation edit rate with targeted human annotation. In Proceedings of the Seventh
Conference of the Association for Machine Translation of the Americas, Cambridge, MA.
Somers, H. (2003). Machine translation: Latest developments. In Mitkov, R. editor, The
Oxford Handbook of Computational Linguistics, pages 512–28, Oxford University Press,
New York.
Soricut, R., Bach, N., and Wang, Z. (2012). The SDL language weaver systems in
the WMT12 quality estimation shared task. In Proceedings of the Seventh Workshop
on Statistical Machine Translation, pages 145–51, Association for Computational
Linguistics, Morristown, MA.
202 Bibliography
Souphavanh, A. and Karoonbooyanan, T. (2005). Free/Open Source Software: Localization.
United Nations Development Programme–Asia Pacific Development Information
Programme.
Spyridakis, J. (2000). Guidelines for authoring comprehensible web pages and evaluating
their success. Technical Communication, 47(3): 301–10.
Steichen, B. and Wade, V. (2010). Adaptive retrieval and composition of socio-semantic
content for personalised customer care. In International Workshop on Adaptation in
Social and Semantic Web, pages 1–10, Honolulu, HI.
Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of the
Seventh International Conference on Spoken Language Processing (ICSLP 2002), Denver,
CO.
Surcin, S., Lange, E., and Senellart, J. (2007). Rapid development of new language pairs at
Systran. In Proceedings of MT Summit XI, pages 10–14, Copenhagen, Denmark.
Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In Calzolari, N., Choukri,
K., Declerck, T., Dogan, M. U., Maegaard, B., Mariani, J., Odijk, J., and Piperidis,
S., editors, Proceedings of the Eighth International Conference on Language Resources and
Evaluation (LREC’12), Istanbul, Turkey, European Language Resources Association
(ELRA), Paris.
Toker, D., Conati, C., Steichen, B., and Carenini, G. (2013). Individual user characteristics
and information visualization: connecting the dots through eye tracking. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, pages 295–304,
ACM.
Torresi, I. (2010). Translating Promotional and Advertising Texts. St. Jerome Publishing,
Manchester.
Turian, J., Shen, L., and Melamed, D. (2003). Evaluation of machine translation and its
evaluation. In Proceedings of MT Summit IX, pages 61–3, Edmonton, Canada.
Underwood, N. and Jongejan, B. (2001). Translatability checker: a tool to help decide
whether to use MT. In Proceedings of MT Summit VIII, Santiago de Compostela, Spain.
Van Genabith, J. (2009). Next generation localisation. Localisation Focus: The International
Journal of Localisation, 8(1): 4–10.
Vasiļjevs, A., Skadiņš, R., and Tiedemann, J. (2012). Letsmt!: A cloud-based platform
for do-it-yourself machine translation. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics (ACL2012), pages 43–8, Jeju, Republic of
Korea.
Vatanen, T., Väyrynen, J. J., and Virpioja, S. (2010). Language identification of short text
segments with n-gram models. In Calzolari, N., Choukri, K., Maegaard, B., Mariani,
J., Odijk, J., Piperidis, S., Rosner, M., and Tapias, D., editors, Proceedings of the Seventh
International Conference on Language Resources and Evaluation (LREC-2010), Valetta,
Malta, European Language Resources Association, Paris.
Wagner, E. (1985). Post-editing Systran: A challenge for commission translators.
Terminologie & Traduction, 3.
Wass, E. S. (2003). Addressing the World: National Identity and Internet Country Code
Domains. Rowman & Littlefield, Lanham, MD.
Wojcik, R. and Holmback, H. (1996). Getting a controlled language off the ground at
Boeing. In Proceedings of the First Controlled Language Application Workshop (CLAW
1996), pages 114–23, Leuven, Belgium.
Yang, J. and Lange, E. (2003). Going live on the internet. In Somers, H., editor, Computers
and Translation: A Translator’s Guide, pages 191–210, John Benjamins Publishing,
Amsterdam.
Bibliography 203
Yunker, J. (2003). Beyond Borders: Web Globalization Strategies. New Riders, San Francisco,
CA.
Yunker, J. (2010). The Art of the Global Gateway. Byte Level Research LLC, Ashland, OR.
Zouncourides-Lull, A. (2011). Applying PMI methodology to translation and localization
projects: Project integration management. In Dunne, K. J. and Dunne, E., editors,
Translation and Localization Project Management: The Art of the Possible, pages 71–94,
American Translators Association Scholarly Monograph Series, John Benjamins
Publishing Company, Amsterdam.
Index

.NET framework 49, 64, 67, 94 catalog file 33–4; compilation 89;
generation 86–7, 96; see also PO;
ACCEPT 80, 146, 190 RESX
Accept-Language HTTP header 176 CDN 180
access: challenge 6; to context during characters: corrupted 152; display 46;
translation 65; to Web content 68; escape 32–3; language 22, 107;
via the global gateway 58–60 processing 58, 100; selection of 26–7;
adaptation 1, 9–10, 165; audio 169–70; sequences of 20; shortcut 62; syntax
functionality 57–8, 177–180; 99; wildcard 39; see also ASCII;
graphics 166–9; location 180–2; encodings; hotkeys; input; ITS; tag;
strings 20; textual 173–7; video translation memory; Unicode
170–2 checking: appropriateness 169;
Adobe: Flash 14–15; FrameMaker 52; controlled language 74; language
globalization at 129; MT post-editing 74–7, 80, 179; rule creation 81;
at 105; transcreation at 175 terminology 134–5; translation
AECMA 72, 74 quality 152–7, 161
agile 17, 95 CheckMate 152–4
Amara 170–1 CL see controlled language
Android 18, 186; app localization 114; CNGL 145
App Translation Service 120–1; command line: accessing an online Bash
speech-based applications 7, 180; 44; building an SMT system using
translation guidelines 128 Moses and 161; listing directories
Anymalign 132–3 using 39; running a Python program
API 118, 120, 133–4, 161, 190–1 from 43–4; starting a Python prompt
app 1–3; global 9–10, 50–5; lifecycle using 40–1
164; stores 167–8; Web 18; see also compilation: code 20, 94;
Android; iOS documentation 107; resources 61,
Apple see iOS, OSX 89, 96
application see app Computer Assisted Translation 9; see
ASCII: character-encoding scheme 22; also machine translation; post-
non-ASCII characters 56, 60 editing; translation environment;
translation memory
Bash see command line controlled language 71–5, 109
bigram see n-gram
BLEU 142 DBCS 22
bugs 19, 151 dictionary: for segmentation 140;
machine translation 136–7;
CASMACAT 147 normalization 137; Python 30;
CAT see Computer Assisted Translation search 147; word form 76, 131, 135
Index 205
DITA 35, 83 171; translation 86–9, 105–6, 150;
Django framework 50–1; writing 70–1
internationalization 57, 61–4;
merging and compilation 89; pseudo- hardware 4, 14, 57
localization 67; string extraction 86 hotkeys 87, 90–2, 95, 112, 125
Docbook 35, 52, 69, 107 HTER 142
domain name 59–60 HTML 7, 18; annotation 157; audio 69;
DTD 34–7 conversion from text to 99; Django
DTP 52, 117 templates 62–3; documentation 35,
DVD 8 50, 54–5; editing 98; HTML5 14,
69, 98; internationalization 56, 188;
email: address of technical support static Web sites using 51; translation
contact 10; classification of 179; for 87–8, 105, 152, 159; for Web apps
communication 123; for sending 50, 52
documents to translation 117; hyperlink 88, 108
templates localization 6
encodings 1, 21–6, 46, 56, 107, 181; see i18n see internationalization
also ASCII; GB18030; UTF-8 icon 59, 93, 107, 169
EPUB 54 ICU 57, 101
evaluation: frameworks 157; machine iOS 18, 114, 121, 186
translation 141–3; post-editing 148; IME 57
quality 122; translation 152 input 2, 25, 28, 32, 56–7, 192–3
internationalization 1, 9, 49; content
Facebook 121–2 68–71; Python strings 60–8; software
files 8, 11, 33; see also catalog file; 55–8
encodings; HTML, PO; RESX; TBX; IP address 59
TMX; XML IPython 42–3
FO see XSL IRSTLM 141, 156
formats: data exchange 119; date 57–8; ISO code 59–60
file 7, 14, 22, 64, 98–9, 178; source ITS 69–70, 157
code 112; see also files; strings;
terminology Java 64, 83, 187; see also LanguageTool
FTP 117 JavaScript 50–1, 188

GALA 161 l10n see localization


GB18030 46 language code 86; see also ISO code
Gengo 120, 172 LanguageTool 74–6, 80–2, 131, 153,
gettext 33, 97, 155, 187; see also 180
xgettext Linux 7, 18, 33, 39–40, 51, 54, 57, 77,
GIZA++ 140 91, 121, 128
global gateway 56, 58–9, 76–7, 88–9 LISA 36, 55, 134, 151, 157
globalization 9–10; management systems localization 1; automation 96; binary
7; scope 14; Web 68; workflows 85 94–5; in-context 97–8; testing 92–4,
GMS see globalization 106; updates 95–6; user assistance
GNU see gettext content 98–100
graphics: future 192; impact on latency look and feel 150, 176
181; internationalization of 68;
localization of 9, 14, 166–9; see also Mac see OSX
SVG machine translation 135; controlled
GUI see user interface language for 71–3; hybrid 143–4;
guidelines: adaptation 174; evaluation online 109; rules-based 136–7;
of translation 123; localization statistical 137–43, 160–1
21, 172; machine translatability markup language see HTML, XML
73; post-editing 145; subtitling MERT 141
206 Index
META-NET 179 43–4; statements 41–3; versions 186;
MQM 157 see also command line; encodings;
Microsoft 8, 52, 103, 105–6, 109, 114, internationalization; strings
134, 143, 146 see also Windows PythonAnywhere 23, 41–4
MT see machine translation
multilingual: apps 50, 159, 179–80; QA see quality assurance
content 114, 181; text 21–2, Web quality assurance 9; localization 92–4;
58–9, 191 standards 157; translation 149–57
multimedia see audio, video
Rainbow 100–1, 104, 106, 129–30, 161
natural language processing 7, 72, 108; Ratel 101–2
see also machine translation; part- regular expressions 19, 38–9, 45,
of-speech, text 69, 74, 137; see also CheckMate;
NBA4ALL 51–5, 58, 62, 65–9 LanguageTool; SRX
NGO 5, 12 regulations 1, 4–6, 177, 181
n-gram 141, 142, 155 RESX 64
NLP see natural language processing reuse 36, 52–5, 65, 103–5; see also
NLTK 131 translation memory; UTX
ROI 5
OASIS 47
OAXAL 99 screenshots see graphics
ODF 99 SDL 86, 94, 125
Okapi 100; see also CheckMate; search engine optimization 173
Rainbow; Ratel SEO see search engine optimization
OLIF 134 Smartling 181
operating system 1, 18, 50, 57, 59, SMT see machine translation
93, 188; see also Windows; Linux; software development 17–18
OS X SRILM 141
OPUS 132, 139 SRX 101
OS X 18, 40, 57, 121, 178 strings 18–20, 26–8; concatenating
output 2, 20, 23, 56–7; see also formats; 28–32; expansion 94; extraction 86;
machine translation hard-coding 28; merging 89; Python
methods for 25
part-of-speech 131, 136; see also tag subtitling 170–2, 190
PDF 7–8, 50, 54–5, 99 SVG 46
PE see post-editing SymEval 148, 161
personalization 6–7, 175–7 SYSTRAN 72, 136
PHP 33, 83, 111
placeholder 29, 87, 112 tag: HTML 152; part-of-speech 75–6,
PO 33–4, 60, 79, 89–91, 94, 99, 111, 136; XML 35–6; see also ITS
133 TAUS 145, 157, 160–1
Pontoon 97–8 TBX 134
Pootle 112–13, 124 terminology: acquisition of 132–3;
POS see part-of-speech control 72; glossaries 133–5;
post-editing 144; analysis 148–9; guidelines 89; monolingual
process 105; tools 146–7; types extraction of 129–31; omissions 150;
of 145–6; see also ACCEPT; significance 127–9; see also checking
guidelines text: extraction 79; replacements 45,
programming 18–21, 49, 185–8; see 161, 173; segmentation 100–3, 131;
also Python see also checking; files; strings
properties files see Java TM see translation memory
pseudo-localization 67, 83 TMX 36–7, 105, 127, 154
Python 12–13; code 19–21; transcreation 10, 169, 174–5, 182–3,
environment setup 40–1; programs 190
Index 207
Transifex 111, 119, 121, 125, 153 user interface 50–2, 58–9, 90, 94–5,
translation: api-driven 120; 128, 171, 176
collaborative 121; crowdsourcing UTF-8 22–6, 46, 56
122–3; integrated 120–1; see also UTX 134
guidelines; machine translation;
translation environment; translation video 9, 14, 69, 109, 165–6, 168,
kit; translation management system; 178, 190; see also voice-over
translation memory Virtaal 60–1, 125
translation environment 123–4; voice-over 169–72
advantages 86–8, 117; desktop-based
125–6; online account creation 111; W3C 68–9, 188; see also ITS
post-editing 147; terminology 135; Web service 2, 177–8
Web-based 124–5 Windows 8, 17–18, 40–1, 83, 88, 94
translation kit 33, 86, 100–1, 104, 127 WYSIWYG 97–8
translation management system 96,
100, 117–23 xgettext 60–1, 77–8, 86
translation memory 5, 95, 104–5, 124, XLIFF 34–8, 111
126–7, 175 XML 43–8; extracting text from
Twitter 7, 88, 113, 121 79–80; internationalization 8;
localization 98–100, 106; reuse
Unicode 22–6, 57, 67, 83, 181, 186, 52–5; see also ITS; OLIF; RESX;
188 SRX; TBX; TMX; XLIFF; W3C
user assistance 9–10, 98–108, 112–13, XPath 97
166; see also guidelines XSL 34, 54

You might also like