opentag.com - XML FAQ: Localization

\\ XML and Localization :: FAQ :: Localization

You will find here the answers to some of the frequently asked questions about character representation in XML and related technologies. If you find any mistakes or have suggestions for additional useful information, please send an email.

How do I translate XML documents?
What are the guidelines to create XML document types that will be easier to localize?
What are the guidelines to author XML documents that will be easier to localize?
What are the features a translation tool for XML should have?
How can I specify that elements with complex rules are translatable?
How CDATA sections fare during translation?
Is there any translatable text in a CSS style-sheet?
Is there a standard set of localization directives?
Can I see some examples of XML documents causing localization problems?

How do I translate XML documents?

There are various general translation tools that support XML. For example (by alphabetical order):

Catalyst (Alchemy Software)
Déjà Vu (Atril)
SDLX (SDL International)
TagEditor (TRADOS)
Transit (STAR)

There are several globalization suites that also support XML. For example (by alphabetical order):

Ambassador (GlobalSight)
WorldServer (Idiom)

Note that not all tools and suites provide perfect handling of XML. Some have various problems that, depending on how your XML documents are set and which XML features they use, may make the localization process difficult.

You can also use filters to prepare the XML documents into color-coded RTF files that can be easily translated with or without translation tools. Rainbow offers such filter. Once in RTF you can translate XML documents with other tools such as WordFast.

In all cases, to correctly translate XML documents the localizers need to know the following:

Which elements and attributes are translatable/non-translatable.
Which elements have pre-formatted content (content where white-spaces matter).
Which elements are inline (they should not break a sentence), which are structural.
Whether any element has a content that needs to be treated with different rules (e.g. scripts).

Note that not all these information are explicit in a DTD or a schema.

These information are need to create the "parameter file" (its name varies with each tool) that will allow the tool to process the documents correctly for translation. How to create this parameter file depends on each tool.

What are the guidelines to create XML document types that will be easier to localize?

The W3C ITS Working Group is working to produce such guidelines. Some of the main aspects of the guidelines are the following:

Avoid to use attributes for translatable data.
Provide a way to specify the language of your elements, and use xml:lang for this.
Provide specific elements to delimit content that is coming from an external source (e.g. error messages or prompts from a resource file).
Provide a mechanism of IDs for translatable elements.
When naming your elements think about what is there purpose, not how you imagine the rendering of their content. For example: if an element is used to emphasis a text run call it <emph> not <bold>.

Read also the excellent paper on Localisation Considerations in DTD Design by Richard Ishida.

What are the guidelines to author XML documents that will be easier to localize?

The W3C ITS Working Group will produce such guidelines. Meanwhile you can start with the following:

Always keep encoding in mind. Use the encoding declaration.
Use entity references wisely. Re-use of the same text in different place does not work the same way for all languages. Using entity references may also cause some problem when leveraging from translation memories (e.g. the leveraged segment contains an entity reference not declared in the new document).
Never use infinite naming schemes. (e.g. <str001>, <str002>, ...).

What are the features a translation tool for XML should have?

Some of the XML-specific features a translation tool that support XML documents should have are the following:

Support for all XML constructs (CDATA, namespace, NCRs, etc.). Avoid to write your own parser: using a standard public domain parser is the best way to ensure the tools will parse XML documents correctly.
Support localization directives in the XML document to translate.
Support the xml:lang attribute. If an element uses xml:lang to specify a language that is not the source language, do not offer to translate its content. When outputting the translated file, make sure all values of relevant xml:lang attributes are changed to the target language code. Make this support optional to allow for flexibility.
Make sure the mechanism to specify what elements/attributes are translatable allows to specify nodes rather than just elements/attributes, so the tool can address complex cases. A set of rules based on XPath expressions would be a good way to do this.
If possible, use a standard file format to store the description of what elements/attributes are translatable so that definition file can be shared with other tools.
Make sure the tool can deal with contents where white-spaces are significant (e.g. <pre> in XHTML), or not significant (e.g. <p> is XHTML). Make also sure the xml:space attribute is supported.
If possible, preserve the CDATA sections used in the original document.
Always preserve entity references.
Allows the tool to use specific 'sub-parsers' to process the content of any given element or the value of any given attribute. This will allow, for example, to support easily the content of a <script> element in XHTML, but also data in other formats such as SQL queries, embedded binary data in base64, CSS styles, or any other types of content the tool may support.
If possible, implement the tool in a way it can optionally process the XML documents that have a DTD declared but not available.
Make sure to have integrated support for XLIFF, the XML Localisation Interchange File Format.

How can I specify that elements with complex rules are translatable?

Currently very few translation tools allow you to specify complex rules to define what elements are translatable (or not). For example, in the following document only the content of the second <item> element is translatable, but most of the tools will not allow you to specify the properties of <item> according the value of one (or more) of its attributes:

<?xml version="1.0" ?>
<repository>
 <item type="type">Type definition</item>
 <item type="text">Text to translate</item>
</repository>

One way to solve this issue is to pre-process the document to add a temporary element the translation tools will be able to use directly. For instance, a document such as the following, where the added element <tbt> indicates the content to translate.

<?xml version="1.0" ?>
<repository>
 <item type="type">Type definition</item>
 <item type="text"><tbt>Text to translate</tbt></item>
</repository>

The translation tools can use the added elements to recognize easily the content to localize. Such temporary elements are simply removed after the localization. The pre-processing can be done using a simple XSL template. For example, the following one transforms our first listing into the second:

<?xml version="1.0" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
 <xsl:template match="//item[@type='text']">
  <xsl:copy>
   <xsl:apply-templates select="@*"/>
   <tbt><xsl:apply-templates/></tbt>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

More complex templates can be used, by simply adapting the XPath expression to match a specific set of nodes in the original document. See for instance, this other example.

See also the Internationalization Tag Set (ITS). It offers rules to define translatable and non-translatable parts of an XML document.

How CDATA sections fare during translation?

Some translation tools will not support CDATA sections very well. But a good tool should handle it correctly. However, it might be difficult for the tool to retain the CDATA notation, especially if such section occurs within a content (rather than for the whole content). For example:

<p>The codes <![CDATA[<tag1> and <tag2> ]]>indicate new items.</p>

once translated, will most likely end up as:

<p>Les codes &lt;tag1> and &lt;tag2> indiquent des nouveaux articles.</p>

instead of:

<p>Les codes <![CDATA[<tag1> and <tag2> ]]>indiquent des nouveaux articles.</p>

This is perfectly correct from the XML syntax viewpoint. CDATA sections are just a way to escape a block of text. It is a good idea to stay away from CDATA from a localization viewpoint anyway: NCRs cannot be used in CDATA sections and this may leads to some problem if you need to convert the encoding of a document.

In some cases however, you may want to put back the CDATA syntax in some elements. This can be done easily with XSLT. Simply list the different elements you want in CDATA in the cdata-section-elements attribute of the <xsl:output> element of the template. For example, the following template creates CDATA sections for all element named prog and descrip.

<?xml version="1.0" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
 <xsl:output encoding="utf-8"
  cdata-section-elements="prog text"/>
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

Is there any translatable text in a CSS style-sheet?

Finding translatable text in CSS is usually rare, but definitely possible: As shown below, the value for the content property of the :before and :after pseudo-elements is often translatable:

warning-para:before { content: "Warning! "; border: 2; color: red; }
summary-body:after { content: "End of Summary"; }

The choice of characters for the quote property may also need to be changed to correspond to the appropriate locale preferences.

Any generated content need to be looked at carefully. This includes numbered list generation, markers, and so forth. See more details on CSS generated content in the CSS specifications.

And obviously, any style defined in a CSS style-sheet may need to be modified for a given language.

Is there a standard set of localization directives?

Yes and no. There is a standard called the Internationalization Tag Set (ITS) that is a W3C Recommendation. While ITS is not exactly a standard for localization directives, some of its features can help you with this. ITS can be used as a namespace in any XML document, as in the following examples:

<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"
      xmlns:its="http://www.w3.org/2005/11/its">
 <head><title>Title</title></head>
 <body>
  <h1 id="100">Introduction to 
   <span its:term="yes">Document Management</span></h1>
  <p id="101">Our company, <span its:translate="no">Infinite Wisdom Inc.</span>,
   provides quality courses on how to manage your documentation.</p>
 </body>
</html>

Related information: the ITS Specification

Can I see some examples of XML documents causing localization problems?

Here are a few examples:

Example 1

The following document, while well-formed, is not easy to localize because there is, most of the time, no easy way to indicate to the translation tools that only elements starting with "msg" are translatable. In addition using IDs in element names will most likely cause many other problems.

<?xml version="1.0" ?>
<error-messages xml:lang="en">
 <msg001>Cannot open file {0}.</msg001>
 <note001>Tip: {0} is the name of a file.</note001>
 <msg002>Invalid parameter.</msg002>
 <msg999>Connection not available. Please try later.</msg999>
</error-messages>

Instead use an attribute to identify the changing ID:

<?xml version="1.0" ?>
<error-messages xml:lang="en">
 <msg myId="1">
  <text>Cannot open file {0}.</text>
  <note>Tip: {0} is the name of a file.</note>
 </msg>
 <msg myId="2">
  <text>Invalid parameter.</text>
 </msg>
 <msg myId="999">
  <text>Connection not available. Please try later.</text>
 </msg>
</error-messages>

Example 2

The following multilingual document is difficult to localize because, most if not all the current translation tools available cannot make a distinction of the elements to translate based on what value xml:lang has. In addition, to have a quick turn-around it would be better to translate French and Japanese at the same time, which means some extra work to put back together the translations afterward.

<?xml version="1.0" ?>
<messages xml:lang="en">
 <prompt id="100">
  <data xml:lang="en">Press Enter to start.</data>
  <data xml:lang="fr">Press Enter to start.</data>
  <data xml:lang="ja">Press Enter to start.</data>
 </prompt>
 <prompt id="101">
  <data xml:lang="en">Press Cancel to stop.</data>
  <data xml:lang="fr">Press Cancel to stop.</data>
  <data xml:lang="ja">Press Cancel to stop.</data>
 </prompt>
</messages>

Instead, use one document per language, at least for the material you send to the localizer. If needed, after translation, you can group all entries in a single file, but treat that step as a "compilation-like" step to be done after localization.

Example 3

The following document has one small problem: the content of <srcid> is a unique ID that would be perfect to use for safe leveraging. But element contents are a little bit more difficult to inherit than attributes.

<?xml version="1.0" ?>
<resources>
 <res>
  <srcid>Module123.description::top</srcid>
  <text>Moves to the top of the list.</text>
 </res>
 <res>
  <srcid>Module123.description::bottom</srcid>
  <text>Moves to the end of the list.</text>
 </res>
</resources>

Many tools would work better if the unique ID was stored in an attribute of either <res> or <text>. For example:

<?xml version="1.0" ?>
<resources>
 <res srcid="Module123.description::top">
  <text>Moves to the top of the list.</text>
 </res>
 <res srcid=Module123.description::bottom">
  <text>Moves to the end of the list.</text>
 </res>
</resources>

Example 4

The following document offers a challenge when it comes to specify what it translatable or not. The only way to distinguish the translatable item <data type="text">Cancel</data> for the non-translatable item <data type="text">images/cancel.gif</data> is the value of the type attribute for their respective <component> parent elements. As of today most translation tools do not have a mechanism to specify translatable parts that can address this. Most tools can only specify translatability at the element and attribute level, without conditions.

<?xml version="1.0" ?>
<dialogue xml:lang="en-gb">
 <rsrc id="123">
  <component id="456" type="image">
   <data type="text">images/cancel.gif</data>
   <data type="coordinates">12,20,50,14</data>
  </component>
  <component id="789" type="caption">
   <data type="text">Cancel</data>
   <data type="coordinates">12,34,50,14</data>
  </component>
 </rsrc>
</dialogue>

The best way for tools to address this type of conditional translation would be to support XPath expressions. For instance, in out example the translatable nodes can be expressed as:

//component[@type='caption']/data[@type='text']

If you want to avoid such conditional situations, an alternate way to architecture this XML format would be to have distinct elements for the data types of a <component>. As shown below:

<?xml version="1.0" ?>
<dialogue xml:lang="en-gb">
 <rsrc id="123">
  <component id="456" type="image">
   <url>images/cancel.gif</url>
   <coordinates>12,20,50,14</coordinates>
  </component>
  <component id="789" type="caption">
   <text>Cancel</text>
   <coordinates>12,34,50,14</coordinates>
  </component>
 </rsrc>
</dialogue>

If the format cannot be changed, the solution to translate this file is to have an XSL template that before translation adds extra elements delimiting the content to translate, such as shown below with the <tbt> (to-be-translated) element. These elements can be removed afterward.

<?xml version="1.0" ?>
<dialogue xml:lang="en-gb">
 <rsrc id="123">
  <component type="image">
   <data type="text">images/cancel.gif</data>
   <data type="coordinates">12,20,50,14</data>
  </component>
  <component id="123" type="caption">
   <data type="text"><tbt>Cancel</tbt></data>
   <data type="coordinates">12,34,50,14</data>
  </component>
 </rsrc>
</dialogue>

The XSL template to get this <tbt> addition is the following:

<?xml version="1.0" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
 <xsl:template match="//component[@type='caption']/data[@type='text']">
  <xsl:copy>
   <xsl:apply-templates select="@*"/>
   <tbt><xsl:apply-templates/></tbt>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

Example 5

The following document has several translatable items (marked in bold), but there is no easy way to identify them, even using an XPath expression: the <string> elements represent both keys and values, and in addition not all values are translatable.

<?xml version="1.0" ?>
<resources>
 <section id="Homepage">
  <arguments>
   <string>standard_page</string>
   <string>childlist</string>
  </arguments>
  <variables>
   <string>POLICY</string>
   <string>Corporate Policy</string>
  </variables>
  <keyvalue_pairs>
   <string>bar_Title</string>
   <string>ABC Corporation - Policy Repository</string>
   <string>bgColor</string>
   <string>NavajoWhite</string>
   <string>title</string>
   <string>List of Available Policies</string>
  </keyvalue_pairs>
 </section>
</resources>

A better way to store those data would be to have only one element for each key/value pair, and a distinction between translatable values and non-translatable ones, as shown below:

<?xml version="1.0" ?>
<resources>
 <section id="Homepage">
  <arguments>
   <string key="standard_page">childlist</string>
  </arguments>
  <variables>
   <textstring key="POLICY">Corporate Policy</textstring>
  </variables>
  <keyvalue_pairs>
   <textstring key="bar_Title">ABC Corporation - Policy Repository</textstring>
   <string key="bgColor">NavajoWhite</string>
   <textstring key="title">List of Available Policies</textstring>
  </keyvalue_pairs>
 </section>
</resources>

Example 6

In the following document the text to translate is to be processed with some proprietary tool before being displayed and the XML document is only a storage medium. The problem for localization here is that non-standard markup exists in the document. The [var Destination], [bold] and [/bold] inline codes cannot be protected easily.

<?xml version="1.0" ?>
<KBase>
 <Section id="a123">
  <Title>ProSentinel prevented access to port [var Destination]</Title>
  <Para id="a123-1">The ProSentinel firewall stopped network traffic from 
reaching your computer. No security breach has occurred. [bold]Your data 
are safe[/bold].</Para>
 </Section>
</KBase>

It would be much better to use XML-type inline codes (if necessary using a different namespace than the container format), and to use an XSL template to transform the content to the proprietary codes. The original XML document would then be much easier to localize and look something like this:

<?xml version="1.0" ?>
<KBase>
 <Section id="a123">
  <Title>ProSentinel prevented access to port <var name="Destination"/></Title>
  <Para id="a123-1">The ProSentinel firewall stopped network traffic from 
reaching your computer. No security breach has occurred. <emphasis>Your data 
are safe<emphasis>.</Para>
 </Section>
</KBase>