Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Thu May 25 2000 - 08:42:07 EDT


[ 8859-1 encoded. Reformatted. ]

Jonathan Coxhead wrote:
>
> > and our test code contains strings like
> > // display langage (French)
> > { "anglais", "fran\\u00E7ais", "", "grec", "norv\\u00E9gien",
> > "italien", "xx" },
> >
> > which, of course, need to be double-escaped so that the c compiler does not unescape
> > them itself. they are unescaped at runtime by a library function.
>
> Boy, are you in trouble! :-)

Everyone is in trouble when it comes to UCN and encodings... :-)
Me included, see my previous posts and the correction I sent. :-D

 
> In C99 (newly published and implemented almost nowhere), these new \u and \U
> escapes are **NOTHING LIKE** \x, \n etc, despite their very confusing visual
> simlarity. They are expanded EVERYWHERE in the source file, not just in strings

You are correct, but it does not apply (at least for C99; C++ may be buggy in
this area), because this is *not* a \u escape sequence at all! They are the
sequence of characters 'f', 'r', 'a', 'n', '\', 'u', '0', '0', 'E', '7', 'a',
'i', 's', 0 (for my language).

> (at the same time as trigraphs);

NO. This is the area where C99 differs from C++, because I felt that C++
was tricky for just the reason you are mentionning. Advices differ whether
your interpretation should be followed or not by C++ compilers. But that
interpretation is NOT an option when it come to C99, because we removed
this sentence.

(Now, tools that do the conversion from UCNs to e.g. UTF-8 have to be
smarter, and should not blindly replace \uxxxx sequences with their UCNs
equivalents when it follows a odd number of \. I copy Torsten for this
purpose, I hope he will not miss the post).

\u are expanded in places where \x are not, namely in identifiers,
as you described in your post. But their behaviour in character
constants and strings is very consistent with \x sequence, except
of course that the encoding cannot vary.

(If you find words that may imply the contrary, please refer to me,
because I --and others-- tried hard to avoid any such problem ;-)).

> and they refer to characters in the SOURCE character set, not the
> execution character set. But what happens to them depends on where
> they appear in the source file.

This is correct.

 
> If you use \u, \U in a string, all the following would compare equal (with
> strcmp(), in a Latin-1 environment):
>
> "café", "caf\xE9", "caf\u00E9", "caf\U000000E9"
>
> but in a Cyrillic (ISO 8859-5) execution environment, the first,
> third and fourth should give an error

Sidepoint: an error is not really one of the possible option.

The standard just says (5.1.1.2 Translation phases, phase 5):
"if there is no corresponding member, it is converted to an
implementation-defined member other than the null (wide) character."

Possible solutions are back falling at 'e', using some substitution
character (like U+FFFD or ISO SUB), or just sending E9!
OTOH, certainly a warning might be a good idea here!

Rest is OK, I did not quoted the whole.

> So, if a C compiler saw your example, it would replace "fran\\u00E7ais" by
> "fran\çais" when it processed \u and \U escapes---very early on. Later, it
> would see \ç, which is not a valid backslash sequence, and give an error.

As I said, that is the impression that the reading of C++ standard
gave, and we believed (and I still believe) this was the wrong way.
So we modified the text of the standard to make the interpretation
of UCN to appear at about the same time as "normal" escape sequences.

Discussions why C++ says or should say, and how C++ compilers should behave,
is out of scope here, but I am open to (private) queries this about.

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT