Re: Unicode Stability

From: Asmus Freytag (
Date: Wed Mar 02 2005 - 17:40:03 CST

  • Next message: Kenneth Whistler: "Re: Unicode Stability"

    At 03:04 PM 3/2/2005, Peter Kirk wrote:
    >Doug Ewell's definition of stability that "it does not change in a way
    >that causes existing implementations or data to break".

    As stated, that definition is clearly nonsense.

    By assumption, existing implementations correctly handle only existing
    data. New data will always be able to contain characters at hitherto
    unassigned positions.
    It is always possible for (badly written) existing implementations to
    'break' when exposed to new data. However, there is some predictability to
    allow more forward compatibility: some ranges for default-ignorable
    characters include unallocated code positions so that it is possible for
    old implementations to have ignored a range and therefore be able to ignore
    'future' ignorables.

    More importantly, a new implementations must (be able to) act on existing
    data the same way old implementations did. That precludes moving a
    character, otherwise new implementations would apply the new definition and
    mis-interpret old data.

    It also precludes (in principle) a change in definition of a character such
    that some interpretations of that character are no longer supported. So the
    cleanest way for a disunification would always be to add a *pair* of
    characters. (Ken's HYPHEN-MINUS, HYPHEN and MINUS example, or my AB + A + B

    However, there are cases where that's a foolish consistency. The cleanest
    case is the use of standard Greek letters for Coptic. In principle, we
    would have needed three alphabets: the existing characters (for potentially
    ambiguous mixed Greek/Coptic use as defined in Unicode 1.0 through 4.0),
    the new Coptic characters (as drafted for Unicode 4.1) and a new set of
    unambiguous new Greek characters so that there is absolutely no possibility
    that these 'might' be Coptic.

    In practice that would not have worked. All the mappings are to the
    existing Greek characters. Users of Greek would have simply continued to
    use the 'ambiguous' ones, which everyone treats as 'Greek' by default
    anyway. All the new characters would have done is to create potential
    alternate spellings.

    It is therefore better to just add the Coptic, which allows users of that
    script the desired unambiguous representation of their script. Greek users
    do not need to change, and to the extent that there is exiting Coptic data
    using the mixed model it would continue to be supported - under the same
    restrictions as before, i.e. with use of a Coptic-specific font. (In other
    word, this is the AB + B or AB + A example in terms of my earlier message).

    This works because Greek use of the existing Greek letters is the
    overwhelming majority of all use, and had been the de-facto default
    interpretation. The occasional use of Greek code points with a Coptic font
    has never been a practical problem for Greek users, so that use can
    continue if needed for the support of existing data.

    In contrast. in the case of the HYPHEN-MINUS on the other hand, both
    interpretations are equally likely (or nearly so), therefore adding just
    one other character would have incorrectly forced a single default
    interpretation on the character. Clearly, this is a case where being able
    to explicitly distinguish the ambiguous case is useful and desired, so this
    was correctly implemented as an AB + A + B case.


    This archive was generated by hypermail 2.1.5 : Wed Mar 02 2005 - 17:40:52 CST