Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Frank da Cruz (fdc@columbia.edu)
Date: Tue May 23 2000 - 15:00:58 EDT


> we (ICU) are trying to figure out how best to specify non-BMP (21-bit) code
> points with escape sequences or similar in strings.
>
> Problem:
> The C language has \ooo with octal digits for bytes of whatever encoding,
> and modern compilers also know \xhh with hexadecimal digits (with variable
> numbers of digits). Java introduced \uhhhh with (always 4) hexadecimal
> digits for Unicode code units.
>
> But how does one write a non-BMP code point in this fashion?
>
> I am trying to list some suggestions, make a proposal, and ask you for what
> you are doing or other people/standards/organizations/languages are planning
> to do.
>
Making up new x's for "\x" is not the best way, since, as long as our
programming languages are based on ASCII (a whole different topic), we'll
quickly run out of x's, especially when we are overloading the x by trying to
make it convey two pieces of info: the encoding that follows, and its length.

In the Kermit language, we use:

  \x{yyy...}

where 'x' says what it is (e.g. decimal, hex, octal, whatever), and the
braces delimit the operand, thus allowing it to be any length. This is also
handy for disambiguating expressions like:

  \o0123456

Is that "\o012" followed by "3456" or "\o0123" followed "456"? In:

  \o{012}3456

it's clear.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT