Quick Reference For Translating Complex File Formats
Quick Reference For Translating Complex File Formats
While I have dealt with the more common office formats in earlier sections
(see Office Suites on page 117), in this section I have attempted to categorize
the most commonly required more advanced file formats. You will find
descriptions of the programs for which these are written, how to distinguish
between the translatable vs. untranslatable parts, and how these formats are
supported by computer-assisted translation tools.
The categories of file formats are the following:
There are two different approaches to translating DTP files that depend on
whether they come from programs that are intended for design-oriented
publications (Adobe PageMaker, QuarkXPress, and Adobe InDesign) or from
programs for content-oriented publications (Adobe FrameMaker and Corel
Ventura).
First of all, any of these formats is, of course, directly translatable in its own
environment—i.e., you can overwrite the text of a PageMaker file within
PageMaker—but you will have to save these formats to a non-compiled format
(i.e., text-based format) to process them in a computer-assisted translation
tool.
Another time-consuming task for any of these formats is that due to text-
expansion, the stories will have to be resized after translation—so you need to
make sure that you take that into consideration when accepting a job or
quoting for a job in any of these programs!
This is not where the problems stop, though. Especially QuarkXPress (up to
version 6.5) and PageMaker are still very "last century" when it comes to
processing multilingual text. Though Unicode (see page 4) is a widely
accepted standard that makes it easy to mix and match different writing
systems on web pages and all kinds of other documents, DTP programs are
not up to par on this. Even though Quark does now support Unicode with its
version 7 (released in the summer of 2006), PageMaker most likely will not
because the folks at Adobe have a better choice when it comes to processing
Unicode: InDesign.
InDesign
After a fairly unsuccessful version 1, InDesign really gained traction beginning
with version 2. Presently you will encounter InDesign files that are created in
versions 2, CS (3), or CS2 (4). To efficiently translate in InDesign you will
need a program that exports all the stories (the above-mentioned text boxes)
into one large file which can be processed in a computer-assisted translation
tool. (Of course, it is possible to translate directly within InDesign, but the
emphasis was on "efficient.")
Trados offers little plug-ins as part of all its versions of the Workbench product
that support InDesign versions 2 or CS (the plug-ins are stored under
C:\Program Files\TRADOS\Txx_xx\FI\IND—follow the instructions in the
help file on how to install the plug-ins). Once you have installed the plug-in
and opened the InDesign file, you will see a new Trados menu with all the
necessary commands to export and re-import your file.
As you can see in the above illustration, the text file is not just a "normal" text
file; instead, it is a "tagged" text file where only the smallest part is actually
translatable (essentially everything that in not enclosed by <tag markers>)
and all the other data stores information about details such as formatting.
While it theoretically would be possible to translate this within Microsoft Word
or a text editor, it would be foolish to even try—chances are that you would
break the code or overlook text.
Programs such as Trados TagEditor or Déjà Vu, however, recognize these files
as InDesign files, protect all coding information, and display only translatable
text.
The latest version Star Transit (with a separate plug-in) offers the option of
translating InDesign files, but just for versions 2 and CS. Heartsome as well as
the latest versions of Trados and SDLX support InDesign CS2 files in their
InDesign-specific XML format (.inx) which, like the tagged text files, can be
reimported once the translation is finished.
PageMaker
To translate PageMaker files (an increasingly rare occurrence because Adobe
is trying to push InDesign over PageMaker) with a computer-assisted
translation tool, you could either use Star Transit with a separate plug-in with
support for PageMaker 6-7 or a plug-in that comes with the Trados product
called Story Collector for PageMaker and supports versions 6.5 and 7.
To install the Trados plug-in, open the help file under C:\Program
Files\TRADOS\Txx_xx\FI\PM for further instruction. Once the plug-in is
installed, open the PageMaker file in PageMaker and you'll find the command
Trados Story Collector under Utilities> Plug-ins.
Export all the stories into one large PageMaker-specific text file, save the
original PageMaker file (important!), and translate the exported text file with
TagEditor or any other application that supports the PageMaker format. The
import process is virtually the same as the export and should go seamlessly.
All of the above is true for Western languages and to some degree for Eastern
European languages. Any of the more complex languages, however, including
the bi-directional languages (Hebrew and Arabic) or the Asian double-byte
languages, are flat-out not supported in the Western versions of PageMaker.
QuarkXPress
Despite the fact that Quark has never been very popular in the translation
community (because of a lack of Unicode support until recently and different
and more expensive versions for different languages, etc.), it is (still) the
market leader in desktop publishing, so it is not too surprising that there is
decent support for different versions of Quark among the translation
environment tools.
• Star Transit offers a separate plug-in that supports the batch processing of
the English (and Passport) versions 3-6 for both the Windows and Mac
platforms.
• Trados offers plug-ins for versions 4.1-6 for English (and Passport) and
version 4.1 for Japanese.
• SDLX offers a plug-in for the English (and Passport) versions 4-6 for the
Mac.
The European language Passport edition of Quark, which has additional spell-
checking and hyphenation capabilities for Western and European languages, is
supported by the above mentioned tools. If you only have the (cheaper)
English version, you need to make sure to ask your client to save the file as a
"Single Language" file; if the Passport edition was used, you will not be able to
open the file otherwise.
QuarkXPress’s last Arabic edition was for version 4.1. Fortunately, however,
there are XTensions—QuarkXPress-specific plug-ins—for the English version
of Quark that extend its ability to write in Hebrew, Arabic, Farsi, and Jawi.
ArabicXT, HebrewXT, FarsiXT, and JawiXT are all available at
www.arabicsoftware.net for versions 5 and 6 of Quark.
It becomes much more hairy with the Asian double-byte languages. While the
Japanese version 4.1 is supported by the Trados plug-in and several others by
CopyFlow, it at least means that you have to have several versions of Quark
for different languages, plug-ins, and platforms.
Should the .fm files be displayed with an icon in the form of a question
mark, you need to delete them from the book with the appropriate
command from the Edit menu and then re-add them from within the
Add menu. Once the files are added, you can easily change the order of
the files by simply dragging them within the .book interface.
You will need to save the compiled .fm format within FrameMaker by selecting
File> Save as and selecting the text-based .mif format. To avoid the
individual opening and saving of each file, you can use the well-liked MifSave
(see http://home.comcast.net/~bruce.foster/MifSave.htm) to do this as a
batch process for a whole book.
(By the way, it's totally okay to ask your client to do this for you if you do not
have FrameMaker on your computer.)
Once all your files are preprocessed, they are supported in any of the larger
translation environment tools (Trados, SDLX, Déjà Vu, Transit), most of
whose representatives will tell you that their FrameMaker processing is one of
their strongest features—which only goes to show that FrameMaker is a very
translator-friendly format.
There are slight differences in the way that the different tools process the .mif
files. In Trados you need to convert the MIF files into .rtf files (so-called "STF"
files) with a separate program that is part of the Trados suite of tools, the so-
called S-Tagger for FrameMaker (usually located under Start> Programs>
Trados XX> Filters), before you can translate them in either Word or
TagEditor. The process of converting the files is slightly confusing if you do it
for the first time, but the principle is this:
You will need to create two different directories, one of which will contain a set
of files that will only serve as reference files so that the FrameMaker .mif files
can be reassembled once they are translated in .rtf format. The other will
contain the set of files that are actually prepared for translation. Keep that in
mind in both conversion processes (into .rtf and back into .mif) so that you
select the correct directories either way.
Figure 151: Trados S-Tagger for FrameMaker with tabs to convert files and verify tags
Both Trados and SDLX create additional files (the Trados version is called
ancillary.rtf; the SDLX files are SDLX_ix.mif, SDLX_xr.mif, and SDLX_vr.mif),
which contain background information such as index markers (SDLX only),
cross-references, and information from the master page.
The other tools process the .mif files directly and translate all the background
information individually for each file.
Trados is the only CAT application that supports the Ventura format—but don't
worry, there are very few translation projects in that format.
The process for translating Ventura within Trados files is very simple: You will
need to export the content of the original .vp files to text files (File> Export
Text> ANSI text), translate those in TagEditor, and reimport the translated
text at the place where you want the text to be inserted (File> Import
Text).
Graphic Formats
Pixel-Based Formats
Most graphic formats (including .jpg, .gif, .bmp, .tiff, and various others) don't
contain text. This is true even if it appears to be readable text because the
text is nothing more than pixels (little colored dots) on a virtual canvas. While
they may form shapes that represent letters, these have nothing to do with
the editable letters or words you will deal with in a text editor.
Short of recreating these kinds of graphics from scratch, you will need to get
ahold of the "source files." (Yes, I know that clients hate to be asked for that,
but typically it helps to mention that otherwise they will have to pay ten times
as much.)
Most any of the .jpg-, .gif-, .bmp-, or .tiff-like files were created in a layered
file that included one (or several) layers with real, editable text. Since they
were most likely created in Adobe Photoshop, they will have a .psd extension
and can be opened in, well, Adobe Photoshop. Nice thing is, Adobe is offering
a very low-priced version of its program (see www.adobe.com/products/
This all may not be good enough, though. Especially if you have a large
number of graphics and/or a translation memory database that contains much
of the translation that is contained within the graphics, you will not want to
perform the translation "manually."
There are a variety of tools that allow for the extraction of text from .psd files
into a translation environment tool-specific format:
• Enlaso's and Yves Savourel's Rainbow tool does a lot of different things
(see www.translate.com/technology/tools), including the seamless
preparation of .psd files into XLIFF (for a definition of XLIFF, see page 208)
or Wordfast- or Trados-specific RTF formats. Though this feature is not
included in the freely downloadable version, you can request a key to have
this feature added (under File> Preferences). If you are not a competitor
of Enlaso (i.e., a translation agency), chances are the key will be provided.
Figure 152: XLIFF file generated from .psd file with Rainbow
Vector-Based Formats
The above graphic types are pixel-based graphics. Another kind of graphic
that is often used, especially in manuals, is vector-based graphics. You can
recognize them by their typical extensions, .eps or .ai. They are very different
from pixel-based graphics because they are formed by mathematical formulas
rather than by simple dots. So, rather than displaying a wheel by arranging a
lot of pixels in a circle, a vector-based graphic would calculate it with some
kind of pi-based formula.
Tagged Formats
Tagged files are files that are text-based and that typically contain a mixture
of "normal" translatable text and "tags," elements that allow for the
structuring of the content, page layout, text formatting, insertion of images,
etc. Examples of tagged files are the exported text-based formats for the
translation of content in some desktop publishing programs (see the example
on page 222), but more typically tagged formats include HTML, XML, or SGML
files (see the definition on page 118).
Because tagged text files are "just" text files, they can be translated with a
text editor. The reason why this is typically not a good idea is that
• the tags are quite sensitive to corruption, i.e., just deleting or adding a
part of a tag may utterly corrupt a file;
• though it would be possible to process tagged text files as plain text files in
computer-assisted translation programs, it would mess up your translation
memories with a lot of unwanted coding information; at the same time,
you will not really benefit from your translation memory content because
there will be very few matches for heavily coded sentences.
This is relatively easy to do with HTML because it is a defined format that does
not allow any deviation, but it is more difficult with XML and SGML files. These
files are by definition user-definable and require you to "teach" the program
how to interpret any give file. Any of these file types refers to a "Document
Type Definition" or .dtd file that determines how each element of the file
should be treated.
While the .dtd file for HTML is a global declaration that any of the supporting
tools refer to, XML gives a somewhat universal access through a supporting
technology that describes how to format or transform the data in an XML
document, the so-called Extensible Stylesheet Language (XSL). Many
translation environment tools offer a predefined XML filter based on a
common set of XSL variables that is often sufficient to process XML files.
As SGML files have no such common denominator, you will need to create a
specific "filter" (in Déjà Vu terminology) or "settings" (in Trados terminology)
to process these files.
In Trados, tagged files are processed in the TagEditor (thus the name). Upon
opening any of these file types in TagEditor, the following dialog may appear
(if it is not displayed automatically, you can open it through Tools> Tag
Settings):
You can see the predefined HTML and XSL (for XML) settings.
You can change the properties of any of these settings files by selecting Edit
(something you should only do when you are unhappy with the result the
existing settings files produce) or create a new settings file by selecting Add.
Filter, and the wizard will lead you through the creation of a very
customizable filter file. Unlike in Trados, it is possible to forego the import of a
DTD file and you can choose to directly import an SGML or XML file to create a
filter.
As you import the XML or SGML file into Déjà Vu, you will need to make sure
to select the appropriate SGML/XML filter file during the import process under
Properties.
Most tools, including both Déjà Vu and Trados, allow the fine-tuning of the
filters so that you can exactly determine which parts inside or outside a tag
are translatable or to be protected. Typically, it is enough to go through the
process of creating a filter or settings file for an XML/SGML project only once
because usually all files will adhere to one standard.
• binary files, i.e., files that cannot be opened and edited with a text editor,
and
• flat files, i.e., text-based files that can be opened and edited with a text
editor.
The binary files traditionally include formats such as .exe, .dll, or .ocx files. To
translate these files, you will either
• need a specific software localization tool that allows the direct translation
and necessary strings as well as further language-specific development
work and testing (for further information on processing binary file formats,
see Software Localization Tools on page 206) or
In much the same way that tagged files work, it would be possible to translate
RC files in a text editor, but it is not advisable to do that, because a) you will
most likely overlook text that needs to be translated, b) you may overwrite
code where that should not happen, and c) there is just no reason to not use
your translation memory for this. In fact, software files are rarely translated
Many newer programming languages do not use a compiled format for their
resource files. Often this takes the form of XML-based formats (see for
instance the .resx example on page 235) that are more or less supported by
all major translation environment tools, while in other instances other formats
are used.
Java applications typically use the so-called Java Properties files (.properties).
Java Properties files are supported by SDLX, OmegaT, and Déjà Vu as well as
several localization tools (see Software Localization Tools on page 206).
Trados supports them as well but only through the tedious T-Windows
application which is nothing short of frustrating to use.
Extensions are always a first indication of what the file type could be if
you are not sure what format a certain software file is in but they will
often fail you with software files. If you are not sure about the file type
open it in a text editor and study the structure of the file. If translatables
are enclosed with quotation marks, try to process the file as an RC file
(or, in the case of Déjà Vu or SDLX, you can also test it with one of the other
software filters). If the translatables are preceded by an equal sign, try to process
them with the Properties filter. As all of these files are text-based, this will not
damage the files and very often you will find that you "get lucky," even though the
file at hand may not be one or the other.
Another emerging text-based software standard are GNU gettext .po and -pot
files. These are the translatable language resource files used in the free GNU
gettext concept for translating software and documentation. GNU gettext is
the de facto standard in many open source projects, and it works with a large
variety of programming languages. .po files are typically translated or
pretranslated files, whereas .pot files are the translatable templates.
Aside from the internal tools that gettext offers (see www.gnu.org/software/
gettext), Déjà Vu seems to be the only translation environment tool that
handles these files seamlessly.
Help Systems
Help systems—i.e., the documentation resource that is typically part of a
software program and can be accessed through the help menu—is a huge
topic on its own. I’m not planning to cover this in its entirety, especially
because this has already been done so masterfully in Bert Esselink’s "A
Practical Guide to Localization" (John Benjamins, 2000). But there are a few
questions that I have been confronted with over and over again, and here are
some quick answers for those.
WinHelp
The compiled WinHelp system typically consists of two files, the .cnt file and
the .hlp file. While the .cnt file is a text-based file that contains the table of
contents for the help system, the .hlp file is a compiled file that is made up of
any number of RTF files.
These RTF files have to follow strict guidelines as to how they are
created so that hyperlinks, index markers, sections breaks, etc. function
correctly. Most larger translation environment tools (especially those
that have been around for a while and seen the heyday of WinHelp)
have facilities to accommodate these special features (such as hidden
text for hyperlinks or the various kinds of footnotes).
Figure 160: View of an RTF file before its compilation into a WinHelp
In case you receive a .cnt and .hlp file for quoting or even translation
purposes, there’s an easy way to "decompile" the .hlp file into its RTF
components. While there are a number of expensive commercial tools for
compiling and decompiling WinHelps (for instance, see the well-known
RoboHelp at www.adobe.com/products/robohelp or RoboHelps rightful
One file that is also created in the process is an .hpj file, the help project file.
Though this file is not to be translated, it is important because it contains the
information on how to re-compile the project once the translation is done. The
free Microsoft program that can be used to do just that is called Microsoft Help
HTMLHelp
The process for HTMLHelp is similar but much simpler. Unlike the WinHelp
system, HTMLHelp only consists of one file, the .chm files. True to its name,
most of the translatable content of an HTMLHelp system is contained in HTML
files. To "get to" the HTML files, you will also need to decompile the help file.
Fortunately, both the compilation and decompilation are done with the same
freely available and easy-to-use tool: HTML Help Workshop.
To decompile an existing help file, just select File> Decompile, locate the
.chm file, and choose a location to which you would like to export the files.
You could receive a great number of different file formats, but the most typical
are:
• .hhp: the non-translatable project file (you will need this file to recompile
the help),
• graphic files: these are often translatable and/or have to be replaced with
newly created target counter-parts, and
• lots and lots of .html files with lots and lots of translatable content.
Before you start with the translation of your HTMLHelp project, here is one
thing you should be doing first: Talk to your client about the format in which
the authoring of this project took place. Chances are that it was either
authored in FrameMaker (like this present manual and help system), in some
kind of XML form, or even within Word. While it is entirely possible and really
quite easy to translate the HTMLHelp directly, your client may be much better
served if you are able to work in the original format. Typically the original
authoring environment is set up so that the output can be done in various
formats (PDF, printed materials, web based, help systems, etc.), whereas it is
much more complicated to do this when you start with a help system.
If your client asks you should translate the help system directly, translate the
above-mentioned files, replace the graphics (save them under the same name
and the same location), and then recompile the individual files with HTML Help
Workshop.
Well, usually it doesn’t work quite so easily, because chances are some link
was corrupted, some graphic is missing, or some file was renamed. So it is
advisable to do a quality assurance check which compares the original files
and your newly translated files. SDL offers an excellent product for doing just
that in HtmlQA (see www.sdl.com/products/htmlqa.htm).
The equally helpful sibling product for the WinHelp process is called
HelpQA (see www.sdl.com/products/helpqa.htm).
Once you’ve fixed your errors, you can proceed with the compilation in HTML
Help Workshop. Just select the .hhp file (make sure that it’s placed at the root
of your project folder), select File> Compile, and your help file will be all
ready to go.
You can also use HTML Help Workshop to convert existing WinHelp
projects. When you convert a WinHelp project to an HTML Help project,
the New Project Wizard converts the WinHelp project (.hpj) file to an
HTML Help project (.hhp) file, the WinHelp topic (.rtf) files to HTML Help
topic (.htm, .html) files, the WinHelp contents (.cnt) files to HTML Help
contents (.hhc) files, and the WinHelp index to HTML Help index (.hhk) files.
Database-Based Data
It’s a strange thing with data in databases. So much of today’s translatable
content is stored in databases for easy and quick user access (this is
especially true for web-based content), but translators are often met with a
bit of suspicion when it comes to the translation of that data. And there is
probably something to that suspicion. Much like software development files
• depending on the database, the data may not only not have context but
may also be concatenated (i.e., one string consists of many pieces that are
not necessarily displayed together), thus making it very difficult for the
translator to translate appropriately, and
It is therefore not surprising that a number of tools have tried to come up with
solutions to directly translate database content within the database
environment. Though there are a great number of different database formats,
there are also standards that allow the communication with almost all
database formats. ODBC, Open Database Connectivity, is a native interface
that allows access to most database management systems and allows for the
use of SQL, Structured Query Language, the universally used language to
"talk" to databases. By using this interface and this language, a number of
computer-assisted translation tools are now able to directly translate database
content, even from as complex an environment as Oracle or MS SQL
databases.