opentag.com - XML FAQ: Character Representation

\\ XML and Localization :: FAQ :: Character Representation

You will find here the answers to some of the frequently asked questions about character representation in XML and related technologies. If you find any mistakes or have suggestions for additional useful information, please send an email.

What character can I use in an XML document?
What is an NCR?
Does 146 in  corresponds to Unicode or the document's encoding?
Can I use character entity references like á in an XML document?
How do I define a character entity?
Can element and attribute names use non-ASCII characters?
How do I represent extended characters in CSS?
How do I represent extended characters in URIs (URNs/URLs)?

What characters can I use in an XML document?

The XML specifications define the list of the characters allowed.

Most Unicode characters are allowed in an XML document. They are: U+0009, U+000A, U+000D, [U+0020-U+D7FF], [U+E000-U+FFFD], and [U+10000-U+10FFFF].

in XML, use normalized characters as described in the document Character Model for the World Wide Web (Working Draft).

Note that the use of compatibility characters is not recommended. For more detailed information on the non-suitability of some Unicode characters in XML, see the Unicode Technical report #20.

There are some limitations in XML 1.0 that prevent the use of all characters used in certain languages to be used in XML names (element or attribute names for example). To solve these limitations, the XML version 1.1.

What is an NCR?

NCR stands for Numeric Character Reference. It is the term often used to designate a character written in hexadecimal or decimal format in XML. The hexadecimal form is the preferred form (easier to refer to the Unicode value).

Hexadecimal notation: &#xHHH; where HHH is the hexadecimal value of the given Unicode character. For example é or é for the character é (e-acute).
Decimal notation: &#DDD; where DDD is the decimal value of the given Unicode character. For example é for the character é (e-acute).

NCRs (and character entity references) cannot be used in element and attributes names, in CDATA sections, in processing instructions and in comments.

Conversion Issue - You need NCRs only when a character is not included in the encoding you are using. However, there are some cases where using NCRs even if the encoding supports the character: That is the case for example with a few Japanese characters that present conversion problems when going from one encoding to another or when going from one encoding to the Unicode value. The problem and the characters are documented in the XML Japanese Profile.

Does 146 in `` corresponds to Unicode or the document's encoding?

The value of an NCR is always the value of the Unicode code-point, not the encoding of the file. So 146 in  corresponds to the Unicode decimal code-point 146: the control character PRIVATE USE TWO (U+0092), not the character RIGHT SINGLE QUOTATION MARK (U+2019).

For example, if your document needs to display a character ’ (RIGHT SINGLE QUOTATION MARK, U+2019), you should use ’ (or ’), not  (or ), even if the file is encoded in windows-1252 where 146 is the RIGHT SINGLE QUOTATION MARK.

Note that with some browsers, especially on Windows, when displaying XHTML, the notation  will actually display correctly the character ’. This is not a normal behavior.

Can I use character entity references like `á` in an XML document?

Yes, but only if you define them. There are however five character entities that can be referenced without any declaration. They are:

Entity	Entity Reference	Description	Value
amp	`&`	ampersand (&)	U+0026
lt	`<`	less-than (<)	U+003C
apos	`'`	apostrophe (')	U+0027
quot	`"`	double quote (")	U+0022
gt	`>`	greater-than (>)	U+003E

The forgiving implementations of most HTML parsers that allows the use of character entity references such as á without declaration should not be permitted with XML.

UTF-8 and UTF-16 supports all valid characters, there is really no reason (except for escaping special characters, and in some special cases to avoid conversion problems), to use NCRs or character entity references. If you want to use some form of "escaped" notation of the characters, you probably want to use NCR rather than character entity references.

How do I define a character entity?

A character entity (or any entity for that matter) is defined outside the body of your document, often in a separate file so it can be reused with several documents.

The declaration statement is:

<!ENTITY entity_name "NCR">

For example a character entity for the Euro currency symbol could be declared as:

<!ENTITY euro "&x20AC;">

And it would be used in a document as follow:

<p>the character &euro; is the symbol for the Euro.</p>

Entity declaration statements should be in the DTD or in a module included in the DTD. To include an entity declaration set in a DTD use for example:

<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;

You can find the entity declaration for the many characters with the XHTML Entity Sets:
- http://www.w3.org/TR/2001/WD-xhtml1-20011004/DTD/xhtml-lat1.ent,
- http://www.w3.org/TR/2001/WD-xhtml1-20011004/DTD/xhtml-special.ent,
- http://www.w3.org/TR/2001/WD-xhtml1-20011004/DTD/xhtml-symbol.ent.

Note: Always keep in mind that, from a localization viewpoint, using numeric character references (NCRs) is better than using character entities references.

Can element and attribute names use non-ASCII characters?

Yes. Any Unicode characters valid for XML name can be used. For example, here is a document of a Russian document type with Japanese content.

<?xml version="1.0" encoding="utf-8" ?>
<Собирание версия="1.2-3">
 <Объект id="12">
  <НомерОбъекта>45-3454-123</НомерОбъекта>
  <ВНаличии>123</ВНаличии>
  <Описание xml:lang="ja">第二発電機</Описание>
 </Объект>
 <Объект id="64">
  <НомерОбъекта>45-7894-456</НомерОбъекта>
  <ВНаличии>123</ВНаличии>
  <Описание xml:lang="ja">手動ウォーター・ポンプ</Описание>
 </Объект>
</Собирание>

As XML names cannot use NCRs, the encoding of a document must support all characters used in the element and attribute names it contains.

Note that in XML 1.0 the definition of XML names does not allow certain languages to make use of all their characters. XML 1.1 helps in solving this potential problem.

Warning: Note that many translation tools may not be able to deal very well with non-ASCII element and attribute names.

How do I represent extended characters in CSS?

A CSS file can be in any encoding, so most of the time, specifying the right encoding will allow you to use any extended character directly in the CSS code. However, at time you may not be able to represent a given character in its raw form, you can then use the hexadecimal notation \HHH where HHH is the Unicode value of a given character.

To avoid confusion with the hexadecimal value and the following text (i.e. Is R\26D R+\26+D or R+\26D?) you may want to use a trailing space after the value. Any first space after the hexadecimal notation is ignored, therefore would "R\26 D" display correctly "R&D".

Example of CSS code with extended character in hexadecimal notation:

@import "main.css"
/* Translatable data */
arguments:before {
 content: "Param\E8tres\0A:\0A";
 font-weight: bold
}

How do I represent extended characters in URIs (URNs/URLs)?

[This section needs to be updated]

URIs (Universal Resource Identifiers) can contain non-ASCII characters. The way considered the best to encode such URIs is the following:

Normalize the characters if needed.
Encode the string in UTF-8.
Replace each byte of the UTF-8 string that is greater than 127 (and any other byte that is considered unsafe as described in RFC 2396) by its escaped form %HH where HH is the hexadecimal value of the given byte.

The following examples show both the raw and coded versions of URIs:

urn:TéléCom-Schémas:Facture:v2
urn:T%C3%A9l%C3%A9Com-Sch%C3%A9mas:Facture:v2

http://www.Работа.bg
http://www.%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%B0.bg

More information on internationalized URIs and some code samples of conversion routines are available on the W3C Internationalization page discussing URIs and other identifiers.