Re: Last Call: UTF-16

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 17 1999 - 18:23:22 EDT

Next message: Frank da Cruz: "Re: Last Call: UTF-16"
Previous message: Frank da Cruz: "RE: Last Call: UTF-16"
Maybe in reply to: Michael Everson: "Re: Last Call: UTF-16"
Next in thread: Frank da Cruz: "Re: Last Call: UTF-16"
Reply: Frank da Cruz: "Re: Last Call: UTF-16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frank wrote:

>
> OK, but what's the alternative? I'll grant that IBM (for example) does
> a far better job of registering and documenting their own character sets
> than the IR does, but that's just IBM. If we could get IBM to take
> responsibility for everybody else's character sets (Unicode excluded of
> course :-) we'd be in character-set heaven.
>
> Except for the structural issues. PC code pages were not designed for
> interchange, they were designed for internal use on the PC. Ditto for
> Apple Quickdraw, etc. (Right?)

PC (and Mac and now Windows) code pages were not designed for interchange.
But the networking of PC's has forced them to be used for interchange.
And this problem is not particularly new -- it's 15 years old by this
point. PC code page data is interchanged across platforms, over networks,
all the time. That has just forced their use with other protocols, since
PC code pages are not consistent with the ISO 2022 framework for interchange.

>
> Somebody has to take responsibility for setting *interchange* standards
> (and it's clearly a thankless job :-) Of course there is a lot of junk in
> the IR, but not nearly as much as in the MIME list, since MIME registers
> everything without even asking questions.

I won't step into the issue of MIME types in general, but for MIME charsets,
I grant you that the IANA registry of charsets is basically as big a
mess as the IR registry. To their credit, the IETF folks have recognized
this and have attempted to clean up and rationalize their process for
charsets.

But ultimately the solution is to make use of the universal character
set wherever possible -- and to keep resisting the addition of more
8-bit character encodings that add to the legacy problem and
that add to the registry messes.

>
> Back to specifics...
>
> > All of the UCS registrations are
> > just schematics. And all of those are boshed. Now there are useless
> > registrations for UTF-8 Level 1, 2, 3; UCS-2 Level 1, 2, 3; UTF-16
> > Level 1, 2, 3; UCS-4 Level 1, 2, 3.
> >
> > Those registrations for UCS have no relation whatsoever to the precisely
> > defined versions of the Unicode Standard -- which reflect the reality
> > of implementations by all the vendors. Instead, all the entries in
> > the IR for 10646 represent standards fantasies that cannot be correlated
> > to any specific set of data implemented in most real systems.
> >
> Why are they useless? Who put them there? Over whose dead body?

I cannot fault the ISO folks for making these registrations. By their own
rules they were obligated to follow through. The relevant escape sequences
are defined in ISO/IEC 10646-1:1993 and in Amendments 1 and 2 for UTF-16
and UTF-8. By law, as it were, they were obligated then make sure that those
escape sequences made it into the IR. In actuality, the main function
of the IR is to make sure that nobody steps on anybody else's escape
sequences. 10646-1 also internally makes use of already registered escape
sequences, e.g. ESC 02/02 04/03 to identify the full C1 set of ISO/IEC 6429,
which is IR 77, and so forth.

But the existence of those 12 new escape sequences in the text of 10646-1 and
its amendments to refer to forms of 10646 is not something anyone associated
with the Unicode Consortium was pushing for. In my opinion, they were
considered just the necessary chaff required to get an ISO character
encoding standard complete, and part of the price that had to be paid
to enable a merger of the Unicode Standard and the International Standard.

The particular levels are useless because they cut the problems the
wrong way. Level 3 is simply 10646 without restrictions -- that at least
is o.k., and corresponds to Unicode. But Levels 1 and 2 were attempts to
make announceable forms of 10646 that corresponded to the standards
committees' concepts of simpler forms of the standard, without reference
to what made sense in implementations -- either on a script by script
basis, or in terms of the implementing technology.

More pertinent, perhaps, are the subsets and the subset announcement
mechanism specified in Clause 16.3 of 10646-1. That mechanism, at least,
provides something explicit that could be used to characterize
implementations that choose to cover only a part of the standard.

> Are sections 2.7 and Chapter 3 of the Unicode Standard 2.0 in conflict
> with ISO 10646?

No. That question at least has a simple answer!

> If I have a software package that uses ISO registration
> numbers or escape sequences to announce character sets (and I do -- is
> that a bad thing?) how should I announce UCS-2 if my implementation
> doesn't support combining characters or canonical equivalences? I don't
> mean to be impertinent -- these are genuine questions.

You should announce UTF-16 (or UTF-8), Level 3 and pray that your receiver
interprets that as Unicode. Then you need to find another way to
convey what version of Unicode you are supporting (see
http://www.unicode.org/unicode/standard/versions/) and any restrictions
you have on supported subset (i.e. what characters you can and cannot
display, if you have restrictions that cannot be handled by generic
fallbacks). As to why you need to specify the version of Unicode, the
encoding of Hangul was different in Unicode 1.1 than it is now in Unicode 2.0
and subsequent versions. If you just announce UCS encoding forms and levels
with the IR escape sequences, you cannot know whether you are garbling
Hangul or not. Within the 10646 framework, the only way
to deal with additions is by referencing specific Amendments, but there
is (correctly) no escape code announcement sequence that corresponds to
10646-1:1993 plus amendments 1, 2, 3, 4, and 5 (or any other
combination of amendments), for example.

If your implementation cannot handle combining characters (at all -- which
means it will be restricted to Latin/Greek/Cyrillic/etc. and CJK, and
perhaps a half-assed Hebrew and Arabic that cannot use points), that is
best expressed by repertoire restrictions. And if you cannot deal
with canonical equivalences, that is another way of saying that
your implementation doesn't normalize Unicode data. Better to state that
your implementation requires Unicode in Normalization Form C or Normalization
Form D. That, at least, is well-defined, and you have a prayer of getting
the correct form of data interchanged with you.

>
> >From the mail I've seen on this list over the years it seems like many
> people want to say "our software implements Unicode / ISO 10646 but only
> for precomposed characters" (not to mention "doesn't handle BIDI, etc).
> I realize this is a touchy topic and every implementation should be full,
> but in fact it's a lot easier to skirt this issue at first and still cover
> about 1000 times more languages than you did before, and I suspect this is
> exactly what many companies are doing, and the reason for Levels defined
> in ISO 10646 (the other being the incompatible change regarding Jamos,
> right?)

There are many different dimensions along which a "Unicode-conformant"
implementation can cut corners to get off the ground. The Unicode
Standard doesn't try to spell all that out with a bunch of parameter
settings, because it is so obviously hopeless for a universal character
encoding. Nobody can force all the Unicode applications to deal with
Tibetan, or to deal with a bunch of parameters to specify that if they
did deal with Tibetan, whether they make use of the Sanskrit transliteration
extensions for Tibetan, or whether they decompose all the stacks or make
use of two-part characters for Sanskrit vowels.

Instead, conformance for the Unicode Standard basically comes down
to an admonition that *if* you interpret a particular character from
the standard, you interpret it according to the standard, along with
a set of rules for how not to garble characters that you don't interpret
in a data stream that you are handling.

Your example of "doesn't handle BIDI" comes down to a question of
*if* your implementation interprets characters in the main Hebrew,
Arabic, Syriac, or Thaana blocks of the standard, and *if* it does
any display at all (as opposed to backend processing with no
display component), then it *must* conform to Unicode bidirectional
behavior, since that is part of the specified normative behavior
of characters from those blocks. Of course, you could claim conformance
merely to ISO/IEC 10646-1 and not to the Unicode Standard, and then
interchange UCS Arabic characters while blithely ignoring the Unicode
bidirectional algorithm, but your chance of successfully interchanging
that data with many of the Unicode implementations on Windows or
whatever would be pretty small.

Of course people are looking for more. The normalization forms
(see UTR #15) are an attempt to fill one of the big holes by providing
a well-defined specification of normalized forms, for use by processes
that cannot -- or for performance reasons choose not to -- do equivalencing
on the fly. And because of the size of fonts and the nature of the
problem for font support for display of Unicode, people are looking for
good public mechanisms for announcing what portions of Unicode a particular
font can display, and with what level of fidelity and sophistication,
for example.

But the problem is far more complicated than just announcing Level 1
or Level 2 or Level 3 support, and thinking that will accomplish
anything.

>
> About the separate ISO 10646 reference:
>
> > Conveniently neglecting the very next sentence of clause 6.3, which
> > continues:
> >
> > "When not serialized as octets, the order of octets may be specified
> > by arrangement between sender and recipient (see 16.1 and annex H)."
> >
> But is data on the Internet not serialized as octets? TCP/IP protocol is
> chock full of references to "network byte order" and sockets APIs include
> functions for switching between local and network order.
>
> Again, does anybody really advocate three different forms of UTF-16 for
> interchange?

No. But all three of these forms: UTF-16BE, UTF-16LE, and UTF-16 (with
BOM as announcer) exist and are used in interchange. They are also
now precisely defined in the text of the Unicode Standard, Version 3.0
(forthcoming). What is the problem with getting the same forms and names
precisely defined in an RFC and listed in the IANA registry?

It is a separate battle to fight regarding which form(s), if any, are
appropriate for use in interchange in any particular protocol. As
far as I am concerned, if the developers of Internet protocols choose
to mandate that only UTF-8 be allowable for some (or all) such protocols,
without any use of UTF-16BE, UTF-16LE, or UTF-16, that is a condition
I could live with, because all of these forms are interconvertible with
ease. But UTF-16BE, UTF-16LE, and UTF-16 are already in use in
various vendor and other protocols, and it would be nice if we could
get the naming problem out of the way, ditch "UCS-2" for good, and agree
on our labels.

--Ken

>
> - Frank
>

Next message: Frank da Cruz: "Re: Last Call: UTF-16"
Previous message: Frank da Cruz: "RE: Last Call: UTF-16"
Maybe in reply to: Michael Everson: "Re: Last Call: UTF-16"
Next in thread: Frank da Cruz: "Re: Last Call: UTF-16"
Reply: Frank da Cruz: "Re: Last Call: UTF-16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT