Re: Last Call: UTF-16

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Tue Aug 17 1999 - 16:19:07 EDT


Ken wrote:
> ...
> Humpf.
>
(Great rant!)

OK, but what's the alternative? I'll grant that IBM (for example) does
a far better job of registering and documenting their own character sets
than the IR does, but that's just IBM. If we could get IBM to take
responsibility for everybody else's character sets (Unicode excluded of
course :-) we'd be in character-set heaven.

Except for the structural issues. PC code pages were not designed for
interchange, they were designed for internal use on the PC. Ditto for
Apple Quickdraw, etc. (Right?)

Somebody has to take responsibility for setting *interchange* standards
(and it's clearly a thankless job :-) Of course there is a lot of junk in
the IR, but not nearly as much as in the MIME list, since MIME registers
everything without even asking questions.

Anyway, we've beat this poor horse to death. Sorry to make such a fuss.

Back to specifics...

> All of the UCS registrations are
> just schematics. And all of those are boshed. Now there are useless
> registrations for UTF-8 Level 1, 2, 3; UCS-2 Level 1, 2, 3; UTF-16
> Level 1, 2, 3; UCS-4 Level 1, 2, 3.
>
> Those registrations for UCS have no relation whatsoever to the precisely
> defined versions of the Unicode Standard -- which reflect the reality
> of implementations by all the vendors. Instead, all the entries in
> the IR for 10646 represent standards fantasies that cannot be correlated
> to any specific set of data implemented in most real systems.
>
Why are they useless? Who put them there? Over whose dead body?
Are sections 2.7 and Chapter 3 of the Unicode Standard 2.0 in conflict
with ISO 10646? If I have a software package that uses ISO registration
numbers or escape sequences to announce character sets (and I do -- is
that a bad thing?) how should I announce UCS-2 if my implementation
doesn't support combining characters or canonical equivalences? I don't
mean to be impertinent -- these are genuine questions.

From the mail I've seen on this list over the years it seems like many
people want to say "our software implements Unicode / ISO 10646 but only
for precomposed characters" (not to mention "doesn't handle BIDI, etc).
I realize this is a touchy topic and every implementation should be full,
but in fact it's a lot easier to skirt this issue at first and still cover
about 1000 times more languages than you did before, and I suspect this is
exactly what many companies are doing, and the reason for Levels defined
in ISO 10646 (the other being the incompatible change regarding Jamos,
right?)

About the separate ISO 10646 reference:

> Conveniently neglecting the very next sentence of clause 6.3, which
> continues:
>
> "When not serialized as octets, the order of octets may be specified
> by arrangement between sender and recipient (see 16.1 and annex H)."
>
But is data on the Internet not serialized as octets? TCP/IP protocol is
chock full of references to "network byte order" and sockets APIs include
functions for switching between local and network order.

Again, does anybody really advocate three different forms of UTF-16 for
interchange?

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT