Re: Last Call: UTF-16

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 17 1999 - 15:25:06 EDT


Frank wrote:

> This still does not mean the Internet should endorse or promote the use
> of private character sets, or even allow it.

O.k. This thread has finally got me riled up. It is one thing to
promote the use of a single form of Unicode on the Internet. I think
we can all agree that that is the holy grail we are striving for, since
it would solve so many problems. But thumping the drum for the International
Register of character sets is counterproductive in this regard.

>
> Let us not be so quick to dismiss the International Register. It confers
> numerous benefits:
>
> 1. It maintains a certain consistency. Character sets are required
> to have a certain structure (or, more precisely, to fit into one
> of the predefined formats, or else, like the UCS, into a catch-all
> "other" format).

The consistency is entirely a *formal* consistency. The registrants
are required to fit their garbage into certain numeric shapes, but
there is, beyond that, no real constraint on what kinds of garbage
gets bagged up for registration. There are no real quality controls
here, and no reality checks of implementation consistency.

>
> 2. Character sets registered in the IR come either from the ISO or
> from the standards body of a member nation, and therefore do not
> (in general) reflect the interests of a particular company or a
> the quirks of a particular architecture or operating system.
> Instead, they reflect (presumably) a consensus among parties with
> diverse and possibly conflicting interests, just the Internet itself
> must do.

But they do not reflect any implementation constraints either. Many of
the entries in the IR are the result of standards politics by various
NB's, ungrounded in implementation. In many instances, they represent
the NB equivalent of customers calling up the customer support line
and asking for software features. NB's make up charts for special
purposes, then register them in the IR, where they cannot be stopped,
basically on "spec" -- with the hope that having them in the registry
will get them implemented somehow, or will be a lever to push them into
one of the successful ISO character set standards, such as the 8859 series.
The IR is used as an alternative means of petition, when one or another
group has been rejected for a special pleading character set by
Microsoft or IBM. Ultimately it is the *vendor* character sets that
matter, since those are the ones that carry all the data (including the
vendor implementations of ISO character sets like 8859-x). So everyone
and their brother uses the IR as a way to try to pressure vendors to
implement their particular pipedreams.

For such an example, look at one of the most recent registrations:
IR 204, "Supplementary set for Latin-1 alternative with EURO SIGN."
This is the G1 set of Latin-1, with the EURO SIGN swapped in at 0xA4,
where CURRENCY SIGN is in Latin-1. Shades of the old ISO 646 with the
swapped out local registrations with currency signs at 0x24 DOLLAR SIGN!
What damn use is this? Are vendors going to change out Latin-1 for
Latin-1-IR204-supplem-EURO?? Isn't that what Latin-9 was for (IR 203)?
The three: Latin-1, Latin-9, and now IR 204, are just a prescription
for data chaos and corruption.

Or look at IR 152, "Residual Characters of ISO 6937-2:1983" "This set
of graphic characters comprises the 25 characters specified in ISO 6937-2:1983
but not in Parts 1 to 9 of ISO 8859." Huh? This is just using the IR
for standards bureaucracy -- a file folder to place a collection of
unmapped characters, and not a registration with any implementation
point.

In my opinion, the result is that the IR is far more chaotic, uncontrolled,
and unprincipled in its content than the important collections of vendor
character sets. Whatever their problems, the Microsoft, IBM, Apple,
HP, and other vendor collections of vendor character sets are *architected*
and maintained.

>
> 3. A unique registration number is assigned which allows the character
> set to be identified in a concise, unambiguous, and language-neutral
> manner.

I can't argue that the registration number itself is concise, unambiguous,
and language-neutral. The whole problem is what is the status of the
c**p that tends to be on the other end of that number.

>
> 4. An escape sequence is assigned to designate the character set in the
> ISO 2022 environment. This is an essential component of any character
> set to those of us concerned with terminal emulation.
>
> 5. The character table is printed in the register so we may see the glyphs.

Badly scanned, in many instances. That is just a quibble for the dozens
of registries that just rearrange the 7-bit or 8-bit deck chairs for
European Latin letters. But the CNS registrations, for example, are
completely unusable in their online versions.

>
> 6. The code assignments of each character are given.

Not for all of the registrations. Take a look at IR 161, "Audio Data
Syntax of CCITT Recommendation T.101." "This defines a data syntax for
conveying audio data in a videotext environment. This coding system
is the Audio Data Syntax specified in CCITT Recommendation T.101."
No chart. No names. No specification except by reference. Huh? Are
these even characters?

All of the UCS registrations are
just schematics. And all of those are boshed. Now there are useless
registrations for UTF-8 Level 1, 2, 3; UCS-2 Level 1, 2, 3; UTF-16
Level 1, 2, 3; UCS-4 Level 1, 2, 3.

Those registrations for UCS have no relation whatsoever to the precisely
defined versions of the Unicode Standard -- which reflect the reality
of implementations by all the vendors. Instead, all the entries in
the IR for 10646 represent standards fantasies that cannot be correlated
to any specific set of data implemented in most real systems.

>
> 7. An official name is given to each character so we may identify it
> sufficiently to correlate it with instances in other character sets
> for mapping purposes.

Not true for the Asian registrations, of course. And for those, the
problem of correlation is difficult. The IR contributes nothing to the
solution -- the Asian cross-mappings have all been done elsewhere, by
the IRG, the Unicode Consortium, and by various vendors.

And for the other, small character set registrations, all of the problematical
mappings are just left sitting. Since there is no quality control on the
garbage that goes in, there is no way, in many instances, to tell *what*
the character is, in relation to either the UCS or any other character
set.

>
> 8. The registration is on line at:
>
> http://www.itscj.ipsj.or.jp/ISO-IR/

That is true. And the registration authority in Japan has done an admirable
job of ensuring that the registry is available online and is up-to-date.
The problem is what to do with what is *in* the registry.

>
> (except the ISO 10646 code tables are not online or distributed in printed
> format to register subscribers as far as I can tell).

Correct. So the only character encoding that *really* matters for determining
what all the rest of the characters are for all the other character sets
in the IR are, is the one not available in the registry itself.

Humpf.

--Ken

>
> - Frank
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT