Re: comments: UTF-16, an encoding of ISO 10646 to Informational

From: Francois Yergeau (yergeau@alis.com)
Date: Wed Aug 18 1999 - 11:11:05 EDT


Wow, just a few days away on vacation and I find such a long thread on this
Internet draft! I haven't read all of it yet, but I'm going to address
some of the points, as co-author of the draft.

À 08:03 1999-08-13 -0700, Hart, Edwin F. a écrit :
>I have just one suggestion: I would suggest that the introduction label
>UTF-16 as a "character encoding scheme" (as I understand the term) for
>Unicode/ISO-10646.

The registrations of the "UTF-16", "UTF-16BE" and "UTF-16LE" labels define
more than a "character encoding scheme" (CES), they define the combination
of a CES and a "Coded Character Set" (CCS) as required by RFC 2278 "IANA
Charset Registration Procedures". The CES is that described in the draft
(as well as in 10646 annex C and in Unicode) and the CCS is of course
Unicode/10646.

Regarding the long thread re: the ISO-IR vs the IANA registry:

Michael Everson:
> IANA should NOT "register" character sets. They should make use of
> registered character sets. Full stop.

The two simply do not register the same things. The ISO-IR registers CCSs
(directly usable only within an ISO 2022 environment) whereas the IANA
registers NAMES for complete CCS/CES combinations directly usable in MIME
(Internet mail) and other protocols that leverage the MIME content
labelling mechanisms. Quoting from RFC 2278:

   The term "charset" (see historical note below) is used here to refer
   to a method of converting a sequence of octets into a sequence of
   characters. This conversion may also optionally produce additional
   control information such as directionality indicators.

And the RFC goes on to say that a charset label MUST describe a charset as
defined here.

Regarding the controversial registration of 3 labels instead of one:

À 11:37 1999-08-13 -0700, Frank da Cruz a écrit :
>The IETF is in a position to legislate what flies around on the Internet
>wires, and should exercise its power in this case to mandate UTF-16 in one
>and only one form rather than all possible forms including "guess".

The current state of the draft is what it is precisely because of a
perceived consensus that the IETF is NOT in a position to legislate byte
order. Keith Moore sums it up nicely:

À 12:59 1999-08-15 -0700, Keith Moore a écrit :
>The alternative is worse: people will still send proprietary stuff,
>(and some vendors will still build MUAs that favor use of their
>proprietary data formats, to try to make their competitors' products
>look bad), but they'll either do so without accurate labelling
>(e.g. mislabelling vendor-proprietary-codepage as iso-8859-1)
>or without precise labelling (e.g. labelling everything as
>application/octet-stream and expecting recipients to guess the
>content-type based on the filename suffix). In fact, both of these
>have been done, the products are widely deployed, and those products
>cause interoperability problems.

The Holy Grail is of course to have only one encoding form for *all* text
going down the Internet wires. The IETF prefers UTF-8, but recognizes that
there must be room, at least for a (long) while if not forever, for other
charsets, including UTF-16. Hence the registration.

Would it work to legislate one byte order? Which one? As mentionned in
this thread, most if not all extant standards specify big-endian. But a
certain OS/processor combinations vastly dominates the Internet (see
http://www.statmarket.com/SM?c=Operating_System) and happens to be, alas,
little-endian. Hmmmm... Nobody has a perfect crystal ball and can *prove*
that it wouldn't work, but the consensus, especially ammong those who have
lived through the MIME saga, is that it wouldn't. Hence the multiple
registrations.

Regards,

-- 
François



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT