Re: comments: UTF-16, an encoding of ISO 10646 to Informational

From: Francois Yergeau (
Date: Wed Aug 18 1999 - 11:11:05 EDT

Wow, just a few days away on vacation and I find such a long thread on this
Internet draft! I haven't read all of it yet, but I'm going to address
some of the points, as co-author of the draft.

À 08:03 1999-08-13 -0700, Hart, Edwin F. a écrit :
>I have just one suggestion: I would suggest that the introduction label
>UTF-16 as a "character encoding scheme" (as I understand the term) for

The registrations of the "UTF-16", "UTF-16BE" and "UTF-16LE" labels define
more than a "character encoding scheme" (CES), they define the combination
of a CES and a "Coded Character Set" (CCS) as required by RFC 2278 "IANA
Charset Registration Procedures". The CES is that described in the draft
(as well as in 10646 annex C and in Unicode) and the CCS is of course

Regarding the long thread re: the ISO-IR vs the IANA registry:

Michael Everson:
> IANA should NOT "register" character sets. They should make use of
> registered character sets. Full stop.

The two simply do not register the same things. The ISO-IR registers CCSs
(directly usable only within an ISO 2022 environment) whereas the IANA
registers NAMES for complete CCS/CES combinations directly usable in MIME
(Internet mail) and other protocols that leverage the MIME content
labelling mechanisms. Quoting from RFC 2278:

   The term "charset" (see historical note below) is used here to refer
   to a method of converting a sequence of octets into a sequence of
   characters. This conversion may also optionally produce additional
   control information such as directionality indicators.

And the RFC goes on to say that a charset label MUST describe a charset as
defined here.

Regarding the controversial registration of 3 labels instead of one:

À 11:37 1999-08-13 -0700, Frank da Cruz a écrit :
>The IETF is in a position to legislate what flies around on the Internet
>wires, and should exercise its power in this case to mandate UTF-16 in one
>and only one form rather than all possible forms including "guess".

The current state of the draft is what it is precisely because of a
perceived consensus that the IETF is NOT in a position to legislate byte
order. Keith Moore sums it up nicely:

À 12:59 1999-08-15 -0700, Keith Moore a écrit :
>The alternative is worse: people will still send proprietary stuff,
>(and some vendors will still build MUAs that favor use of their
>proprietary data formats, to try to make their competitors' products
>look bad), but they'll either do so without accurate labelling
>(e.g. mislabelling vendor-proprietary-codepage as iso-8859-1)
>or without precise labelling (e.g. labelling everything as
>application/octet-stream and expecting recipients to guess the
>content-type based on the filename suffix). In fact, both of these
>have been done, the products are widely deployed, and those products
>cause interoperability problems.

The Holy Grail is of course to have only one encoding form for *all* text
going down the Internet wires. The IETF prefers UTF-8, but recognizes that
there must be room, at least for a (long) while if not forever, for other
charsets, including UTF-16. Hence the registration.

Would it work to legislate one byte order? Which one? As mentionned in
this thread, most if not all extant standards specify big-endian. But a
certain OS/processor combinations vastly dominates the Internet (see and happens to be, alas,
little-endian. Hmmmm... Nobody has a perfect crystal ball and can *prove*
that it wouldn't work, but the consensus, especially ammong those who have
lived through the MIME saga, is that it wouldn't. Hence the multiple



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT