Re: Last Call: UTF-16

From: Michael Everson (
Date: Sun Aug 15 1999 - 14:48:19 EDT

Ar 13:18 -0400 1999-08-15, scríobh Frank da Cruz:

>The larger point of my rant (beyond the UTF-16 byte-order issue) is that NO
>private characters sets should be registered for use on the Internet.

I find this troubling. What, exactly, do you mean? The ISO-IR does not
register character sets for particular uses. It registers them to give them
unique identifiers.

>is not a Mac versus PC issue. I fully appreciate how aggravating it is to
>receive (say) email encoded in a PC code page when I have a Macintosh, or
>vice versa. Or for that matter to receive email whose MIME content is
>marked as Microsoft Word when I have text-based email client on UNIX.

Well I use Eudora, which is flexible and contains a number of filters which
can convert text from one character set to another.

>This kind of thing should not happen on an open network, at least not unless
>the two end parties go out of their way to agree beforehand to use
>nonstandard character sets, encodings, or application-specific formats. And
>when they DO agree to escape from the standards, there is no need for a
>registry. It should be entirely irrelevant to the IETF that there even is
>such a thing as a PC code page, or Apple Quickdraw, or any other private
>character set.

>Right, but then we need new and better standards like Unicode. We do NOT
>need to send PC code pages through the Internet, because not everybody has
>a PC, and for that matter, not every PC uses the same code pages.

Look, I can't send anything but 8-bit code pages through the internet
because I use Mac OS 8.5 which has no Unicode support. (Note to Deborah, I
know that there's some stuff hidden in the TEI folder but until I can type
a thorn, an fi-ligature, a c-cedilla and a c-caron in the same file in
TeachText without switching fonts it means that there's no Unicode support
in the Mac OS.) So, Frank, I need to use MIME, and I need to use filters
that can convert from one character set to another. If all data got
converted into UTF-8 to go to the net that would be fine, but it will take
years and years for that to happen. We're going to need tagged mail for
ages to come. Software developers of internet applications need to make
sure that the flexibility is there to the end user to add new codings if

>Yes, but the problem is that MIME lets you specify any character set at all
>in the message header, but it is clearly impractical to force every
>application to understand every character set and encoding. Only the
>absolute minimum number of character sets should be used for interchange.

Here's where you get in trouble. EGT is localizing Eudora Light into
Inuktitut with the Baffin Divisional Board of Education in Iqaluit,
Nunavut. Inuktitut Utilities is a Mac OS WorldScript that I developed to
allow them to write Inuktitut in Syllabics. And lo! a new 8-bit coded
character set was born. An 8-but coded character set which is being used to
code data. Now, we haven't gone and registered this in the ISO-IR yet. But
we should. And there should be a valid MIME alias for everything in the
ISO-IR. We will certainly need to provide UTF-8 conversion for Inuktitut
Syllabics, when we have to convert from Mac Inuit to Unicode. But in the
meantime we have to exchange data with Mac Inuit.

>If I am writing (say) an email client for (say) a Macintosh, then I would
>expect to be required to know about ISO standard character sets and to
>convert between them and Apple ones, but I do not see how I could be
>expected also know about PC code pages, NeXTSTEP, Data General, Hewlett
>Packard, EBCDIC, and every other conceivable encoding. Where would I even
>find the specifications?

I believe they are on the Unicode CD. All those entities should register
their coded character sets in the ISO-IR, though. There should be standard
plug-in mappings from ISO-IR x to UTF-8 and vice-versa.

>In a few years, perhaps, the Internet will carry UTF-8 and UTF-16 (hopefully
>in one form only) on the wire, and then we won't have to worry about losing
>em-dashes (or OE digraphs, or per-mil signs, etc) when interchanging data
>across the Internet, no matter what platforms are on either end.

Maybe, but people with non-Latin-1 character set requirements are having to
exchange data TODAY.

Michael Everson * Everson Gunn Teoranta *
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Guthán: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement)
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT