Re: Last Call: UTF-16

From: Michael Everson (
Date: Sun Aug 15 1999 - 12:23:21 EDT

Frank da Cruz and Eric Brunner were disagreeing. Eric and I have talked a
bit offline and some confusions I had are clearing up.

Frank said:

>Internet standards have to do with what goes on the wire. Where character
>sets are concerned, Internet standards should recognize only international
>standard character sets, namely those registered in the ISO International
>Register of Coded Character Sets, as is UTF-16. So far so good.

Eric disagreed:

>I disagree. Not with the wire format part of his observation.

I am not sure why Eric disagrees, but wait a minute everybody. There is a
difference between an international standard character set (such as ISO/IEC
10646, ISO/IEC 6937, ISO/IEC 8859-1, ISO/IEC 8859-14, ISO/IEC 8859-15,
etc.) and a character set which has been registered in the ISO
International Register of Coded Character Sets. A real ISO standard goes
through a long rigorous process of review by National Bodies, who are
responsible for the technical content of the standard. ISO standards are
registered in the ISO-IR.

But other character sets can also be registered in the ISO-IR. For these,
only the requesting body is responsible for the content of the character
set. It is not that difficult to register a character set, either. Rules
are set forth in ISO 2375, which is currently under revision.

Frank responded:

>That note was dashed off in haste. The idea I was trying to get across is
>that, when a particular writing system may be represented by more than one
>character set, and one of them is an international standard, and the others
>are not, there is no reason or justification for the IETF to recognize the
>nonstandard ones, nor does it serve any useful purpose. For example, if
>Latin-2 is registered for use on the Internet, there is no reason to also
>register PC Code Page 852 and/or 1250.

I could not disagree more. This means that 90% of the PCs in the world,
which use Windows extended Latin-1 and Mac Roman, aren't served. This
irritates me a lot as a Mac user, but also in general, because Latin-1
itself is defective, lacking proper quotation marks and en- and em-dashes.
The wire should transmit data, not necessarily interpret it. I though the
interpretation was supposed to be sorted out by the message headers.

>If Old High Blackfoot has not been registered by the ISO, then it should be,
>assuming there is consensus on its form and content. In the meantime, if
>there is (say) a Canadian or Blackfoot national standard for it, then it can
>be used in the interim.

Certainly anyone using a coded character set for special purposes should be
encouraged to register it so that vendors know about it.

>The broader the scope of a registration authority,
>the more it is to be preferred since, presumably, it represents a broader
>concensus. However, this view ignores political questions and so is not
>entirely satisfactory, but politics are everywhere. (This view also raises
>the question of multiple registration authorities, which in turn suggests a
>need for a registration authority for registration authorities.)

No, we don't need multiple registration authorities or multiple registries.

>But this is a tangent. My point was: the Internet should not be registering
>and blessing every corporate character set (e.g. PC code page) that comes

The Internet should not be registering and blessing *anything*. The
Internet should use the ISO-IR. ISO has set up a Registration Authority for
character sets. We do NOT need more than one registry.

>There are two ends of a network connection. If the software at each end has
>to understand every wacky character set that was ever invented, and every
>conceivable byte order for multibyte character sets, it would be
>unnecessarily complex and ungainly, and nobody would bother to support the
>character sets (or byte orders) they didn't care about, so I think this
>approach inhibits rather than fosters open communication.

But you're saying that people have an obligation to support what's in the
ISO-IR? It is pretty large right now, and I have not seen a lot of support
on the part of the vendors. I mean, most of what's there doesn't even have
mapping tables to the UCS. Yet.

>(A registration
>authority that registers every character set is like a patent office that
>approves every patent application -- there is also supposed to be a search
>for prior art to prevent duplication.)

ISO 2375 does specify this. But it's duplicate coding that is forbidden,
not field of application.

>If, on the other hand, only a small set of standard character sets are
>allowed on the wire (sufficient, obviously, to represent all desired writing
>systems), then each application on each end system only needs to know the
>standard ones in addition to its own local ones.

I have a real problem with this. "The wire" shouldn't make any judgement
about what is and is not allowed.

Michael Everson * Everson Gunn Teoranta *
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Guthán: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement)
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT