Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Wed May 24 2000 - 05:54:14 EDT


Markus Scherer wrote:
>
> we (ICU) are trying to figure out how best to specify non-BMP (21-bit) code
> points with escape sequences or similar in strings.
>
> Problem:
> The C language has \ooo with octal digits for bytes of whatever encoding, and
> modern compilers also know \xhh with hexadecimal digits (with variable numbers
> of digits).

The situation you described is a bit old. I do not believe that a compiler
that does *not* support \xhh will be in current production use these days
(except if you are writing for PDP-11 or similar cases ;-)).

The ISO C standard, formerly ANSI C, has \xhh since the beginning (I understand
first copies were floating around in 1985; the standard was formally accepted
in 1989 by ANSI, and 1990 by ISO). It standardized existing practices in this
area, so there were a lot of "non-modern" compilers (that does not understand
prototypes, for instance), that do know about \xhh before even 1985.

The new revision, nicknamed C99, as well as the C++ Standard (1998), add \uxxxx
and \Uxxxxxxxx notations (x being any hexadecimal numbers). New compilers
are now shipping with this support (I admit there are not many of them).

> Java introduced \uhhhh with (always 4) hexadecimal digits for Unicode code units.
>
> But how does one write a non-BMP code point in this fashion?

Use \Uxxxxxxxx. BTW, if a project like ICU is going to use such notation,
this will bring some pressure on compilers' providers (being GNU/FSF or
traditionnal vendors) to sort out the issue with Unicode coding (i.e.
meaning of wchar_t), which in the end will result in greater Unicode use.

 
> I am trying to list some suggestions, make a proposal, and ask you for what you
> are doing or other people/standards/organizations/languages are planning to do.
>
> - One could use a pair of code units, UTF-16 style:
> \ud89a\udcba

C99 explicitely forbids this (i.e., a message is required from a conforming
compiler).

> - In UTR 18, Mark Davis suggests a syntax
> \vhhhhhh
> with exactly 6 hexadecimal digits.
> Drawback: I am afraid of confusion with the ANSI C language \v
> for the vertical TAB.

You are correct, this is not an option.

 
> - How about - and I propose this here -
> \whhhhhh
> with, again, 6 hexadecimal digits?
> It is simple, and for English speakers it has the benefit of
> being mnemonic because of connotations with "wide" and the
> letter being called a "double u" - which is more than a "\u" :-)
> It is not used in C.

Correct, it is reserved for future extensions.

However, the benefit against \Uxxxxxxxx (which requires exactly 8 digits)
is a small gain in length (usually 3 "000"), but it lacks being a standard...

Marco Cimarosti answered:
> Frank da Cruz wrote:
> > In the Kermit language, we use:
> > \x{yyy...}
>
> Nice. I wish C was like that. It's certainly more practical than changing C
> and C++ standards every time a character encoding standard adds the next bit.
> ('Cause we *will* see a 32-bit character set sooner or later, won't we?)

Sorry. C standard *is* this way (but without the {}).

I mean, the \x notation is C is variable-length, and adjusts accordingly to
the underlying encoding (i.e., on a EBCDIC--targetted program, space is \x40;
and on a (theoritic) UTF16-targetted program, Amacron is \x0100, and the first
codepoint outside the BMP is \xD800\xDC00).

On the other hand, \u and \U notations are charset-independent; so \u0024
and \U00000024 are two dollar signs ($), whatever the underlying encoding
used (being EBCDIC, UTF-8, etc.)
 

Hope it helps,

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT