Question about \uxxxx etc. for 21-bit code points - need advice

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue May 23 2000 - 14:41:44 EDT


Hello,

we (ICU) are trying to figure out how best to specify non-BMP (21-bit) code points with escape sequences or similar in strings.

Problem:
The C language has \ooo with octal digits for bytes of whatever encoding, and modern compilers also know \xhh with hexadecimal digits (with variable numbers of digits).
Java introduced \uhhhh with (always 4) hexadecimal digits for Unicode code units.

But how does one write a non-BMP code point in this fashion?

I am trying to list some suggestions, make a proposal, and ask you for what you are doing or other people/standards/organizations/languages are planning to do.

- One could use a pair of code units, UTF-16 style:
  \ud89a\udcba
  This is clumsy because
  + it is long
  + the code point needs to be factored into surrogates
  + it works all right only if the underlying string encoding
    is UTF-16; if UTF-32 or UTF-8 are used internally, then
    the escape-sequence parser actually needs to detect two
    subsequent \u's, make sure that they form a matched pair,
    and combine them into a code point.
    For UTF-8, it then has to be factored again into bytes.

- In UTR 18, Mark Davis suggests a syntax
  \vhhhhhh
  with exactly 6 hexadecimal digits.
  Drawback: I am afraid of confusion with the ANSI C language
  \v
  for the vertical TAB.

- How about - and I propose this here -
  \whhhhhh
  with, again, 6 hexadecimal digits?
  It is simple, and for English speakers it has the benefit of
  being mnemonic because of connotations with "wide" and the
  letter being called a "double u" - which is more than a "\u" :-)
  It is not used in C.

- Should there be a delimited, variable-length form like
  \whh...h;
  or
  \w{hh...h}
  or similar, closer to HTML?

Of course, a longer form would coexist with the common \uhhhh, so that the longer one would be in practice used only for code points >0xffff. This seems to remove the motivation for a variable-length form. For ICU resource bundles, the 2-digit \xhh (for the Latin-1 subset) and the 4-digit \uhhhh already coexist.

I don't know what Java is planning to do, or if C/C++ standards actually deal with Unicode and related issues at all (beyond what I read in the ANSI C standard from 1990).

What are Microsoft or Apple planning?

Markup languages for comparison:

The HTML and XML and related languages already have a mechanism for referencing any Unicode code point, although only the XML specification actually explicitly talks about the range reaching up to 0x10ffff. The older HTML specification only refers to "ISO 10646 character numbers", but by referring to the ISO UCS, I assume that they actually allow code points up to 0x7fffffff.

Syntactically, however, &#dd...d; and &#xhh...h; do not fit in well with backslash-escapes.

Please advice!

markus

HTML and XML references:

HTML: http://www.w3.org/TR/html401/charset.html#entities Chapter 5.3 "Character references" specifies the decimal and hexadecimal numeric character references as "ISO 10646 character numbers" without explicitly mentioning the range of those numbers.

XML: http://www.w3.org/TR/REC-xml Chapter 4.1 "Character and Entity References" refers to _code points_ of ISO/IEC 10646.
In the same document, Chapter 2.2 "Characters" specifies the character range for XML to be that of UTF-16 (minus characters that are not legal in XML).



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT