Re: UTF-8 reg tags...

From: Francois Yergeau (
Date: Mon Sep 16 1996 - 16:27:28 EDT

À 15:03 16-09-96 -0400, Glenn Adams a écrit :
>The UTC discussed the need for a generic designator along the lines
>of your argument. There was general agreement that a generic designator
>would be good for this purpose; howeve, we didn't come to complete
>closure on this.

Why then is the UTC *opposed* to registering "UTF-8" (you did ask that I
withdraw my draft) ?

>What we did agree was a need for the versioned
>designator exists due to the incompatibile changes from UC 1.1 to UC 2.0.
>The problem with using a generic designator to refer to UC 2.0 is that
>one can't be sure it isn't referring to UC 1.0, for which implementations
>may very well exist that haven't been upgraded to either 1.1 or 2.0.

It can very well be specified in the registration that "UTF-8" applies only
to Unicode 2.0 and later. But then "UNICODE-1-1-UTF-8" and "UNICODE-1-0-*"
also need to be registered, if anyone has a need for them.

I still think that registering "UNICODE-2-0-*" would be a bad move,
conducive to less interoperability in the future, as I have explained in my
previous message. Of course, I cannot stop the UTC from registering
whatever it wants, though.

>As for the difference between 10646 & Unicode, there are particular
>assumptions one must make with Unicode to remain conformant that don't
>apply to 10646; e.g., the default usage of level 3, the use of the
>canonical Unicode equivalence algorithm, the Unicode BIDI algorithm,
>the Unicode script shaping rules (not defined by 10646), the Unicode
>character semantics, the fact that Unicode doesn't directly make use
>of collection identifiers or level designators, etc. Unicode has its
>own conformance clause which does not apply to 10646. It is thus
>important to maintain the distinction at the level of MIME designation.

I don't see how this follows from that. Tagging of implementation levels
was discussed on the ISO10646 list, and it was found that distinguishing
them doesn't buy you anything. A level 3 (Unicode) implementation doesn't
care if it receives level-1-only data, and a level 1 implementation is not
made any more capable of dealing with combining characters by receiving a
level-3 tag. The same argument works for collections: as far as the process
of interpreting UTF-8 bytes into characters is concerned, you don't gain
anything from knowing which collection(s) the characters come from, you will
decode them in the same fashion anyway. And I don't quite see how canonical
equivalence, BIDI, shaping and character semantics have any relevance to
this process. But perhaps I am missing something?


François Yergeau <>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT