Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Wed May 24 2000 - 11:14:11 EDT


Marco Cimarosti wrote:
>
> Antoine Leca wrote:
> > Marco Cimarosti answered:
> > > Frank da Cruz wrote:
> > > > In the Kermit language, we use:
> > > > \x{yyy...}
> > >
> > > Nice. I wish C was like that. It's certainly more practical
> > > than changing C and C++ standards every time a character
> > > encoding standard adds the next bit. [...]
> >
> > Sorry. C standard *is* this way (but without the {}).
>
> Hmmm... I didn't make my point clear enough. I meant "the curly brackets
> in the Perl syntax are a good idea". [...]

My mistake. I did not get your point, and I stand corrected.

 
> > I mean, the \x notation is C is variable-length, and adjusts
> > accordingly to the underlying encoding (i.e., on a EBCDIC--
> > targetted program, space is \x40; and on a (theoritic) UTF16-
> > targetted program, Amacron is \x0100, and the first codepoint
> > outside the BMP is \xD800\xDC00).
>
> It is not correct to say that it adjusts to the underlying encoding:
> a C compiler knows no "underlying encoding", apart the one the source
> itslef is written in.

You are referring yourself to a particular C compiler when you are
saying this. There is nothing that prevent C compilers to be cross-
charset compilers (e.g., with an ASCII input and an EBCDIC output),
and in fact such compilers, although rare, do exist.
(But there are way rarer that the compilers that handle many
multibyte encodings, of course).

> The length of the \x escape sequence depends only on the characters
> following it: it ends at the first character that cannot be
> interpreted as an hexadecimal digit

That is correct, that is the point of your post I missed entirely.

> And this is precisely what I am not confortable with, because it
> makes escape sequences ambiguous. Take for example "\x2Two": it
> expands to { 2, 'T', 'w', 'o', 0 }. But if you translate the
> "Two" in French, you get "\x2Deux" that expands to
> { 45, 'e', 'u', 'x', 0 }...

Yes. The "correct" (but clumsy) way is to write "\x2" "Deux"
(and it does not hurt to write "\x2" "Два" --Dva in
Cyrillic-- just in case someone expands the hexadecimal notation
to handle the non-Latin equivalents ;-)).
In fact, I believe the allowed mixing of escape sequences with
"normal" characters in a C string is basically a bad idea).

I understand this is the very reason why the \u and \U escape
sequences in C++/C99 are *not* variable in length, how comfortable
it would be to save a few keys.

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT