RE: [OT] non-terrestrial writing systems

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 05 2007 - 14:33:45 CDT

  • Next message: Arne Goetje: "Re: swastika"

    Ah well, another year has passed, so it must be time again to
    worry about 17 planes not being enough. :-)

    > Doug Ewell écrit le mardi 5 juin 2007 05:32 à Unicode Mailing List:
    > >
    > > Back in the day when ISO 10646 was still 31 bits wide and the proposal was
    > > made to limit it to 17 planes, as Unicode already was, there were quite a
    > > few, apparently serious, objections that this would be a regrettable,
             ^^^^^^^^^^^^^^^^^^
             
    The operative words.

    > > Y2K-like limitation because of the eventual discovery of non-terrestrial
    > > scripts that would need the extra coding space. I think some of us who
    > > remember this being portrayed as a genuine technical flaw in Unicode still
    > > tend to wince when the topic is brought up, even if the humorous intent is
    > > clear to everyone else.
    >
    > If there was a flaw, it was not originally from Unicode, but from the
    > designers of the UTF-16 encoding which was first released as a RFC and then
    > adopted by ISO, before being made part of the Unicode standard.

    I think Philippe has his history mixed up here.

    The RFC for UTF-16 was the last in this sequence. That is RFC 2781,
    dated February, 2000, by Hoffman and Yergeau. It bases its
    definition (as it should) on the then Annex Q of 10646, then
    cited as ISO/IEC 10646-1:1993 plus amendments.

    RFC 2781 in turn refers to RFC 2271 (BCP 18), dated January, 1998,
    by Alvestrand. That RFC refers to UTF-16, although not defining
    it, and also cites ISO/IEC 10646-1:1993 plus amendments.

    Amendment 1 (UTF-16) to ISO/IEC 10646-1:1993 was actually published
    in early 1996.

    UTF-16 and the use of surrogates was also published in 1996 in
    Unicode 2.0 (deliberately, as part of the ongoing synchronization
    with 10646).

    Amendment 1 was actually drafted by Mark Davis, who was then WG2
    project editor. And the first draft was WG2 N970, dated 7 February 1994.

    UTF-16 was *first* presented to WG2 in WG2 N883, Proposal for Extended
    UCS-2, by Joe Becker, dated 21 January 1993 (= X3L2/93-016). The
    extension scheme was known as "UCS-2E" through most of 1993, until
    it was rechristened "UTF-16" around January, 1994, with a revised
    range of code points for the surrogates.

    What people need to understand, also, is that as of 21 January 1993
    the exact architecture of the merged Unicode and 10646 standards
    was still in play. The concept of an East Asian character set
    "swapping" area in the then "O-Zone" was still being advocated
    as a way to extend the BMP. That was the ghost of ISO 2022, still
    not yet vanquished as an approach to universal character encoding
    at the time.

    The Unicode script committee had submitted a paper, drafted in
    late 1992, demonstrating that the already identified need for
    encoding as yet unencoded scripts would exceed the BMP available
    space by at least 10,000 code points, if the O-Zone were keep
    for an East Asian swapping scheme. It was quite apparent to
    everyone at the time that the BMP simply wasn't going to be
    enough -- particularly with the building pressue to encode more
    Hangul syllables on the BMP (the eventual Amendment 5 to 10646).
    So it was also apparent there had to be an extension mechanism.
    The question was merely *which* extension mechanism would
    be most acceptable and least disruptive going forward.

    And as should be clear from the above, the historical direction
    for UTF-16 was: UTC --> 10646 --> IETF, and not the reverse, as
    implied by Philippe's summary.

    >
    > Nothing was ever prepared to allow apossible extension of UTF-16 to more
    > than 17 planes (and nothing has been done since then to allow more
    > surrogates in the BMP to make this possible using 3 surrogates).

    That at least is correct -- largely because in the 13 years since
    UTF-16 was proposed to WG2 for 10646, there has been no need to.
    Nor *will* there be within the lifetimes of anyone reading this
    email list.

    >
    > If one ever wants to have 31-bit codepoints, the only way is to allocate a
                                                       ^^^^^^^^
                                                       
    False, of course, because UTF-32 (and UCS-4, its twin) are 32-bit
    encoding forms, using 32-bit code units to represent code points.

    But I presume what Philippe means to say is that the only way to
    represent characters encoded past U+10FFFF using 16-bit Unicode
    code units is to allocate a...
                                                       
    > net set of surrogates either within the PUA block (but this may conflict
      ^^^
      new
    > with many current uses of the PUA block in the BMP), or within the special
    > plane 14 (but this will require using 2 supplementary codepoints, each one
    > using 2 normal surrogates (i.e. a total of 4 surrogates, i.e. coding 31 bit
    > codepoints using...64 bits), and this will break parsers that expect
    > codepoints to be terminated after the first 2 surrogates.

    Any allocation beyond U+10FFFF will break many things beyond such
    parsers at this point.

    But there is little to worry about, because:

    A. Such an extension is not needed.
    B. Such an extension is not going to happen.

    > (The most efficient way to reach the 30 bit limit would be to have another
    > 10-bits wide block in the BMP, but chances are now very low that this will
    > be ever possible, the convenient 1024-codepoints space that remained between
    > the Hangul syllables and existing surrogates being reserved now for Hangul
    > extensions).

    Once again little deterred by the facts...

    There has never been a "convenient 1024-codepoints space ... between
    the Hangul syllables and existing surrogates." Hangul syllables
    stop at U+D7A3. The first surrogate code point starts at U+D800.

    I'm guessing that Philippe is bemoaning the loss of the contiguous
    block of U+A800..U+ABFF *before* the start of Hangul syllables.
    But in my opinion, the current allocation of that area to the
    encoding of Syloti Nagri, Phags-pa, Saurashtra, Kayah Li, Rejang,
    Cham, Tai Viet, and Old Hangul jamo extensions -- all demonstrably
    existent and in demonstrable need of encoding -- is a far wiser
    use of BMP allocation than reserving code points for speculative
    extension schemes for characters that don't exist.

    > But do we need such extension?

    In a word, no.

    --Ken

    > Note that there are now variation selectors
    > to qualify the existing characters, without having to encode many
    > compatibility characters in the future.



    This archive was generated by hypermail 2.1.5 : Tue Jun 05 2007 - 14:37:37 CDT