opentag.com - Text Filtering

\\ Technologies :: Text Filtering

Text is stored in various types of formats: tagged documents, database tables, resources files, and many other kind of repositories. Some formats are easily accessible for translation and tools have been developed to handle them directly (Word documents or HTML files for example). In many other cases, accessing the text to translate is not that simple, even without using specific translation tools. The translatable data is mixed with codes, or even in binary formats.

To solve that problem, localization processes often use some form of dedicated tools that filter the original file and create a temporary copy of it in an alternate format where the text that should be translated is easily accessible.

They are different approach on how to store the non-translatable parts: some tools will place the translatable text and the codes in different files, while others will simply encapsulate the codes. Extracting has the small advantage that the method is useable regardless of the original format, while encapsulation has some limitation when it comes to non-text based native formats.

The text extraction method has been used for quite a while in localization. The early filters developed at Eurosoft and ILE in the late 80s, the mechanism used in Joust/TSS by Alpnet (one of the first TM system), the filtering process utilized in IDOC's XL8 tool, and the early S-Tagger application developed by ITP for MIF and Interleaf files, are some examples of such initial use.

There are two non-proprietary formats for extracted text: OpenTag and XLIFF. Both are XML applications and take advantage of all the internationalization readiness of XML. OpenTag was the forerunner of XLIFF. XLIFF is now widely regarded as the industry standard.

One of the problem of filtering translatable text is to be able to know what text is translatable. Some formats have very clear rules that allows filters to easily do the job, but many offer more challenge.

XML vocabularies are examples of such difficulty: while parsing XML is easily done, knowing which elements and attributes to localize is another matter, especially since anyone can define its own XML format. This could be addressed by the definition of XML localization properties.

Server-side formats, where the source documents are really made of two different formats together, and scripts embedded in various places are other types of files where identifying translatable text can be difficult. One way to address this, when it cannot be solved by normal internationalization practices, is to use localization directives.