Re: UTF-8 reg tags...

From: Francois Yergeau (
Date: Mon Sep 16 1996 - 14:35:22 EDT

À 09:22 16-09-96 -0400, Glenn Adams a écrit :
>(1) the designation "UTF-8" does not designate a coded character set (i.e.,
>a repertoire and its code set); rather it designates only a transformation
>method that could potentially be used with arbitrary CCSs.

Not so. This issue was raised long ago on the ISO10646 list, and it turned
out that the proper expansion of "UTF-8" is "UCS Transformation Format 8"
(see ISO 10646 amendment 2), which correctly identifies the code set as well
as the encoding. If you want to use the UTF-8 algorithm to encode some
other CCS, *then* you need another MIME tag.

>(2) the designation "UTF-8" is inconsistent with currently specified
>designations. See my previous message.

See my previous reply, this is an extremely weak argument. What is to be
gained by maintaining this consistency?

>>Has a case been made for
>>the need of a version number in the MIME tag(s)?
>If it weren't for the incompatible changes between UC 1.0 and UC 1.1 (removal
>of Tibetan and reassignment of other chars to obtain the merger with 10646)
>and the changes between UC 1.1 and UC 2.0 (reassignment of Korean Hangul), then
>version designation would not be a significant issue. The UTC will *strongly*
>oppose any further incompatible changes; however, we have to deal with history
>as it stands.

1.0 is irrelevant to the Internet, of historical interest only; there is not
even a registration for it.

>>Is it based on the change
>>to the Korean encoding? If so, does the UTC consider this incompatible
>>change to be a serious problem in practice?
>Yes, the UTC considers UC 2.0 to be an incompatible change to UC 1.1; thus
>the need for a version designation.

That was not the question, we all know that the Hangul reassignment was an
incompatible change. At issue is whether this matters *in practice*.

The idea behind "UTF-8" is not to save a few bytes from "UNICODE-X-X-UTF-8"
or "ISO-10646-UTF-8", but is the result of some forward thinking.

Let's make 3 assumptions:
 1) There will be new versions of Unicode;
 2) There will be no further incompatible changes;
 3) Software will not be upgraded instantly.

I think 1) and 3) go without saying, and you have just reinforced my faith
in 2) (thank you). Now let's say Unicode 3.0 has recently been published,
but I have not yet upgraded my mail user-agent, which knows only about
Unicode 2.0. I get a UTF-8 mail from someone who *has* upgraded his mailer.
If the message is tagged as "UNICODE-3-0-UTF-8", my mailer does not
recognize the tag and fails to decode the message; if, however, the message
is tagged with a generic "UTF-8", my mailer groks it and displays the
message correctly, except for the eventual few characters new to 3.0.

This is the price for a tag containing the version number, and it needs to
be weighed against the alternative. And this hinges on the *practical*
impact of the Hangul reassignment.

Thinking about it, we don't need "UNICODE-2-0-UTF-8" even if the Hangul
change *is* serious; all we need is "UNICODE-1-1-UTF-8" for data encoded
before the reassignment and "UTF-8" for the rest, ad vitam aeternam
(assuming, again, no further incompatible changes).

>Another issue which you haven't addressed with a simple "UTF-8" designation
>is the distinction between Unicode and 10646. Unicode entails (and requires
>certain semantics) that are 10646 does not. Thus it is essential to
>among a UTF-8 encoding of Unicode and one of 10646.

Now that is new. I have not thought over the question completely, but my
first impression is that this is irrelevant to a MIME charset tag. Unicode
does specify the semantics of characters where 10646 does not, but a MIME
tag is concerned *only* with how to transform a sequence of octets into a
sequence of characters, not with what to do with the characters thereafter.
What do you think?


François Yergeau <>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT