Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Marco Cimarosti (marco.cimarosti@europe.com)
Date: Wed May 24 2000 - 07:46:31 EDT


Antoine Leca wrote:
> Marco Cimarosti answered:
> > Frank da Cruz wrote:
> > > In the Kermit language, we use:
> > > \x{yyy...}
> >
> > Nice. I wish C was like that. It's certainly more practical
> > than changing C and C++ standards every time a character
> > encoding standard adds the next bit. [...]
>
> Sorry. C standard *is* this way (but without the {}).

Hmmm... I didn't make my point clear enough. I meant "the curly brackets in the Perl syntax are a good idea". You can agree or not, but you cannot remove "the curly brackets" without being left with a non-sense: "in the Perl syntax are a good idea".

> I mean, the \x notation is C is variable-length, and adjusts
> accordingly to the underlying encoding (i.e., on a EBCDIC--
> targetted program, space is \x40; and on a (theoritic) UTF16-
> targetted program, Amacron is \x0100, and the first codepoint
> outside the BMP is \xD800\xDC00).

It is not correct to say that it adjusts to the underlying encoding: a C compiler knows no "underlying encoding", apart the one the source itslef is written in. The length of the \x escape sequence depends only on the characters following it: it ends at the first character that cannot be interpreted as an hexadecimal digit (cmp. http://www.dinkumware.com/htm_cl/charset.html).

And this is precisely what I am not confortable with, because it makes escape sequences ambiguous. Take for example "\x2Two": it expands to { 2, 'T', 'w', 'o', 0 }. But if you translate the "Two" in French, you get "\x2Deux" that expands to { 45, 'e', 'u', 'x', 0 }...

_ Marco
______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT