RE: Question about \uxxxx etc. for 21-bit code points - need advi ce

From: Paul Dempsey (paulde@microsoft.com)
Date: Tue May 23 2000 - 15:00:40 EDT


From section 2.2 of a final draft of ISO/IEC FDIS 14882, Programming
languages --- C++:

-------------------------------
-2- The universal-character-name construct provides a way to name other
characters.

hex-quad:
        hexadecimal-digit hexadecimal-digit hexadecimal-digit
hexadecimal-digit

universal-character-name:
        \u hex-quad
        \U hex-quad hex-quad

The character designated by the universal-character-name \UNNNNNNNN is that
character whose character short name in ISO/IEC 10646 is NNNNNNNN; the
character designated by the universal-character-name \uNNNN is that
character whose character short name in ISO/IEC 10646 is 0000NNNN. If the
hexadecimal value for a universal character name is less than 0x20 or in the
range 0x7F-0x9F (inclusive), or if the universal character name designates a
character in the basic source character set, then the program is ill-formed.

-------------------------------

--- Paul
> -----Original Message-----
> From: Markus Scherer [mailto:markus.scherer@jtcsv.com]
> Sent: Tuesday, May 23, 2000 11:37 AM
> To: Unicode List
> Subject: Question about \uxxxx etc. for 21-bit code points -
> need advice
>
>
> Hello,
>
> we (ICU) are trying to figure out how best to specify non-BMP
> (21-bit) code points with escape sequences or similar in strings.
>
> Problem:
> The C language has \ooo with octal digits for bytes of
> whatever encoding, and modern compilers also know \xhh with
> hexadecimal digits (with variable numbers of digits).
> Java introduced \uhhhh with (always 4) hexadecimal digits for
> Unicode code units.
>
> But how does one write a non-BMP code point in this fashion?
>
> I am trying to list some suggestions, make a proposal, and
> ask you for what you are doing or other
> people/standards/organizations/languages are planning to do.
>
> - One could use a pair of code units, UTF-16 style:
> \ud89a\udcba
> This is clumsy because
> + it is long
> + the code point needs to be factored into surrogates
> + it works all right only if the underlying string encoding
> is UTF-16; if UTF-32 or UTF-8 are used internally, then
> the escape-sequence parser actually needs to detect two
> subsequent \u's, make sure that they form a matched pair,
> and combine them into a code point.
> For UTF-8, it then has to be factored again into bytes.
>
> - In UTR 18, Mark Davis suggests a syntax
> \vhhhhhh
> with exactly 6 hexadecimal digits.
> Drawback: I am afraid of confusion with the ANSI C language
> \v
> for the vertical TAB.
>
> - How about - and I propose this here -
> \whhhhhh
> with, again, 6 hexadecimal digits?
> It is simple, and for English speakers it has the benefit of
> being mnemonic because of connotations with "wide" and the
> letter being called a "double u" - which is more than a "\u" :-)
> It is not used in C.
>
> - Should there be a delimited, variable-length form like
> \whh...h;
> or
> \w{hh...h}
> or similar, closer to HTML?
>
> Of course, a longer form would coexist with the common
> \uhhhh, so that the longer one would be in practice used only
> for code points >0xffff. This seems to remove the motivation
> for a variable-length form. For ICU resource bundles, the
> 2-digit \xhh (for the Latin-1 subset) and the 4-digit \uhhhh
> already coexist.
>
> I don't know what Java is planning to do, or if C/C++
> standards actually deal with Unicode and related issues at
> all (beyond what I read in the ANSI C standard from 1990).
>
> What are Microsoft or Apple planning?
>
> Markup languages for comparison:
>
> The HTML and XML and related languages already have a
> mechanism for referencing any Unicode code point, although
> only the XML specification actually explicitly talks about
> the range reaching up to 0x10ffff. The older HTML
> specification only refers to "ISO 10646 character numbers",
> but by referring to the ISO UCS, I assume that they actually
> allow code points up to 0x7fffffff.
>
> Syntactically, however, &#dd...d; and &#xhh...h; do not fit
> in well with backslash-escapes.
>
>
> Please advice!
>
> markus
>
>
> HTML and XML references:
>
> HTML: http://www.w3.org/TR/html401/charset.html#entities
> Chapter 5.3 "Character references" specifies the decimal and
> hexadecimal numeric character references as "ISO 10646
> character numbers" without explicitly mentioning the range of
> those numbers.
>
> XML: http://www.w3.org/TR/REC-xml Chapter 4.1 "Character and
> Entity References" refers to _code points_ of ISO/IEC 10646.
> In the same document, Chapter 2.2 "Characters" specifies the
> character range for XML to be that of UTF-16 (minus
> characters that are not legal in XML).
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT