Re: Last Call: UTF-16

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Sun Aug 15 1999 - 13:18:33 EDT


 Michael Everson <everson@indigo.ie> said:
> Frank responded:
>
> >That note was dashed off in haste. The idea I was trying to get across is
> >that, when a particular writing system may be represented by more than one
> >character set, and one of them is an international standard, and the others
> >are not, there is no reason or justification for the IETF to recognize the
> >nonstandard ones, nor does it serve any useful purpose. For example, if
> >Latin-2 is registered for use on the Internet, there is no reason to also
> >register PC Code Page 852 and/or 1250.
>
> I could not disagree more. This means that 90% of the PCs in the world,
> which use Windows extended Latin-1 and Mac Roman, aren't served.
>
Wow, I must really be incoherent today...

The larger point of my rant (beyond the UTF-16 byte-order issue) is that NO
private characters sets should be registered for use on the Internet. This
is not a Mac versus PC issue. I fully appreciate how aggravating it is to
receive (say) email encoded in a PC code page when I have a Macintosh, or
vice versa. Or for that matter to receive email whose MIME content is
marked as Microsoft Word when I have text-based email client on UNIX.

This kind of thing should not happen on an open network, at least not unless
the two end parties go out of their way to agree beforehand to use
nonstandard character sets, encodings, or application-specific formats. And
when they DO agree to escape from the standards, there is no need for a
registry. It should be entirely irrelevant to the IETF that there even is
such a thing as a PC code page, or Apple Quickdraw, or any other private
character set.

> This
> irritates me a lot as a Mac user, but also in general, because Latin-1
> itself is defective, lacking proper quotation marks and en- and em-dashes.
>
Right, but then we need new and better standards like Unicode. We do NOT
need to send PC code pages through the Internet, because not everybody has
a PC, and for that matter, not every PC uses the same code pages.

> The wire should transmit data, not necessarily interpret it. I though the
> interpretation was supposed to be sorted out by the message headers.
>
Yes, but the problem is that MIME lets you specify any character set at all
in the message header, but it is clearly impractical to force every
application to understand every character set and encoding. Only the
absolute minimum number of character sets should be used for interchange.
If I am writing (say) an email client for (say) a Macintosh, then I would
expect to be required to know about ISO standard character sets and to
convert between them and Apple ones, but I do not see how I could be
expected also know about PC code pages, NeXTSTEP, Data General, Hewlett
Packard, EBCDIC, and every other conceivable encoding. Where would I even
find the specifications?

In a few years, perhaps, the Internet will carry UTF-8 and UTF-16 (hopefully
in one form only) on the wire, and then we won't have to worry about losing
em-dashes (or OE digraphs, or per-mil signs, etc) when interchanging data
across the Internet, no matter what platforms are on either end.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT