Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Fri May 26 2000 - 08:41:50 EDT


OK, if you believe the discussion is better here, I shall stick to that.

Markus Scherer wrote:
>
> i have not read the new drafts for c and c++,

There is nothing like "new drafts". The C++ Standard, ISO/IEC14882:1998,
have been promulgated since almost two years now. The new revision of the
C Standard, ISO/IEC 9899:1999, is promulgated since six months. These are
*closed* works now. Only minor clarifications are possible.
The next version is expected around 2005 (in draft form) for C++...
(The numbering scheme is quite different from the Java one).

> but this sounds like they are not well synchronized with each other and with java.
> the java parsing behavior is very well specified in
> http://java.sun.com/docs/books/jls/html/3.doc.html .

We (the C committee, I mean) are quite aware of the Java specs, thanks.

 
> i want to urge everyone with any influence on the c and c++ standards to
> make sure that they do the same for \u parsing (and \U) as java does.

[ \U is a invention of the C++ committee, so I do not see how it applies. ]

Be sure that everything possible have been made to assure the most
possible equivalence. However, there are basic assumptions, such as
the fact that Java sources are required to be ASCII characters, that are
*not* assumed by the C or the C++ standards, and it is *not* the idea
of either committees to align with Java here (for example, as the text
for Java stands, a plain UTF-16 source text is not conforming...)

Therefore, the C Standard chose, 12-13 YEARS AGO, to left to the compiler's
implementor the issue of the mapping from physical source characters to
the characters as seen by the parser. So what is described in the Java
reference, subclause 3.3, is in C and C++ left to the implementor
(translation phase 1, 5.1.1.2 in C or 2.1 in C++, with the same text).
The Java way have a lot of sense, no doubt about it, but C constraints
were different and result in a different result.

> otherwise, porting between c/c++ and java would become unnecessarily
> more complicated.

It is my understanding that trying to enforce alignment of one Standard
exactly towards another is not the best solution for impeeding portability
problems. I think the far better solution is for programmers to restrict
themselves to subsets of either language.
However, this is a personnal opinion.

 
> one of the things that the \u parsing in java recognizes is that even
> numbers of backslashes before a u will not result in \u parsing, which
> is what one would expect with common escaping behavior.

Good. As I explained before, this behaviour is not enforced in the
C++ Standard, and I find this may lead to problems.
I have done (note the past tense here) my best to modify the C Standard,
in order to correct this area and to issue a revision that is more
"usable". However, when we got understanding of the problems, this was
back in 1997, and it was far too late to correct the C++ Standard.

Anyway, as I also explained, in the current text for the C Standard,
it is my impression that the behaviour is exactly the same as the one
you described for Java, at least for well-formed programs (as called by C++).
Furthermore, I invite anybody who have doubts here to communicate with me,
because if I may prove wrong, then I have the ways to get the Standard
corrected. However, don't misunderstood me: this is not because I make
this invitation that I believe there are any problem of "synchronisation".
And you have to convince me of any trouble, not to just raise the issue!

Now, it should be taken in account that almost NO implementations today
support the UCNs, being for C or for C++ (and this was even truer
18-24 months ago, when we modified the C Standard toward the present
state). Since almost all C++ compilers are also C compilers (to some
extend), we thus expect from all C++ implementations to act respectfully
to the C Standard (which we designed in a compatible way, needless to say).

So, NO, I do not expect any portability problem in the use of UCNs,
except for very tricky cases. There is a more in-deep discussion of the
tricky problems that I presented to the C committee two years ago,
at <URL:http://www.multimania.com/antoineleca/Ucn.htm> (this text is
certainy plenty of mistakes, I shall appreciate your constructive
critics; I have already noted a problem with the use of "Unicode"!)

Also, one may ask where someone can look after the C or C++ Standards.
Regarding C, the last public draft available can be reached at
<URL:http://anubis.dkuug.dk/JTC1/SC22/WG14/www/docs/n869/>.
It is very near from the actual C Standard in the UCN area.
Regarding C++, look after <URL:http://www.cygnus.com/misc/wp/dec96pub/>
or <URL:http://www.comnets.rwth-aachen.de/doc/c++std/root_frame.html>.
I can't comment about the accuracy on the UCN theme, but I believe
this is fairly close.

> it is confusing enough that different \ escapes are unescaped at different
> times of the parsing,

Yes it is confusing, and for that very reason I did my best to revert (in C)
to the normal state of affairs everytime it was possible. So for example,
in C (and this is different in C++), UCN in strings and character litterals
are handled exactly as "normal" escape sequences.
Only UCN in identifiers are different; and UCN are prohibited everywhere else.

So no Indic digits to write a number, and no U+2028 is allowed to break lines);
and so if an implementation is intended to fully conform with Unicode with
respect to the source texts, then the implementation-defined pahse at the
beginning of translation should be used to map these characters to the meaning
of the equivalent C source characters, but portability is lost here.
Frankly I do not see the problem this may raise practically.

> but having slightly different rules for these programming languages is not
> going to make it easier.
>
> if it is true that the 3 standards define this in 3 different ways, then it is a mess.

It is a mess for "Standard lawyers" like I am, true. However, we did our
best to hide this mess to the users of Standards.

Look, this is exactly the same as with Unicode with respect to 10646: they
are different standards, and they expose things in quite different ways.
However, the people who actually write both standards are also doing their
jobs well, and as a result both standards (where common) are more or less
the same, at least when viewed from an "user"'s point of view.

> (it also seems artificial to limit the range of code points to non-ascii values.)

This is for both performance reasons (because C compilers exist for a while,
contrary to Java compilers, UCN are an add-on for them; so it has been
thought that forcing implementors to add a new filter on input was not
a good idea), and also and much more importantly, because some constructs
might become very very difficult to understand if UCN were allowed to
represent normal characters: for example, how should "\u0022" be parsed?
Also, this scales very badly with trigraphs. Then, outside the well-
known problem of EBCDIC !/|, and if we except of course for the Obfuscated
C Contest, there is absolutely NO USE of such a convention (again, conside-
ring that we have trigraphs to handle the EBCDIC indetermination problems).

 
> failing synchronization of c/c++ with java, i believe that synchronization
> of c with c++ is an absolute must.

Again, be assure that we have this in mind. However, we (both committes)
decided in 1998 better to correct one of the Standard rather that to have
*two* brocken standards, in an area that was rather new to the programming
community.

Hope things are clearer now. I remain available for your further questions.

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT