Re: Unicode Stability

From: Asmus Freytag (
Date: Wed Mar 02 2005 - 14:34:42 CST

  • Next message: Tom Emerson: "Re: teh marbuta"

    It's time to bring a bit more systematic treatment to the
    discussion of stability. Here's a rundown:

    The problem:

    A character AB is used ambiguously to represent both A and B

    Possible scenarios:

    - Do nothing

       Everybody uses AB as before. Users and software must rely
       on context to distinguish A and B. If context does not
       allow for reliable distinction, but the shape of A and B
       are different, software will fail to meet users' expectation.

    - Add both A *and* B as new characters

       Existing data uses AB. Users can use AB if they want to
       represent absence of information (i.e. if the user can't
       decide whether A or B is intended in the text when typing
       it in). New data can use A and B. However, there will be
       a transition period where neither A nor B are supported
       by fonts and software. (This is true for all new character
       additions). New software will need to add explicit support
       to map A and B to AB for searches. Old software will not
       support matching A and B with AB in searches.

    - Add B as a new character

       Existing data uses AB. New software will assume it's A.
       Where the shape of an 'A' matches that of an AB, old data
       will display as before.
       If the user can't decide whether A or B is intended in
       the text when typing it in, AB should be used by default.
       In principle, the use of AB no longer indicates an am-
       biguous situation. However, if the contexts would not
       allow a 'B', software could consider this as an indication
       that *old* data is present, and treat AB as a B.
       New data can use B to unambiguously mark a usage as B.
       However, there will be a transition period where B is not
       yet supported by fonts and software. (This is true for
       all new character additions). New software will need to
       add explicit support to map B to AB for searches. Old
       software will not support matching B with AB in searches.

    - Add A as a new character

       The same. The only difference between these two is which
       interpretation of AB is considered the default. If one
       of the two shapes A or B is vastly more prevalent, or
       if the use of either shape is a permissible fall-back,
       that shape would make the better default.

    - Add a variation selector for B

       Same as adding B as a character, except that all software
       could elect to ignore the variation selector, treating
       all instances of AB as ambiguous. Some software ignores
       all variation selectors. For example in searching and
       sorting. Such software will not be able to make the
       distinction between AB and B. In other words, the 'fix'
       is limited to display and rendering. Software or fonts
       that rely on the presence of the variation selector, may
       not display AB the same as old software did for old data.
       (See discussion above).
       There will be a transition period until all software can
       either handle or ignore the variation selector, until
       then, text with the variation selector may display
       incorrectly, appear broken, or may result in processing

    - Add a variation selector for A

       The same.

    - Add a variation selector for AB

       Generally the same as adding a variation selector for
       either A or B, but explicitly supports the use of
       a standalone AB as ambiguous.

    Stability evaluation:

    This is the complete matrix. In all cases there are circum-
    stances where the software violates user's expectations.
    This is true even for the 'do nothing' case, since by
    definition, the use of AB for both A and B is considered
    a problem. Otherwise we would not look for a solution.

    However, there are important differences between the solutions
    that need to be considered. They affect the stability of
    software and the stability of data in different ways.

    One of the working assumptions of Unicode is that data are
    forever. Once data exists in a particular form, it is expected
    that software will continue to handle it. On the other hand,
    software is expected to undergo regular updating. That is
    already needed to handle additions to the Unicode standard,
    as well as other technological changes that are not affected
    by the particular issue (disunification of AB).

    In all scenarios, old software will continue to handle old
    date as before. In no scenario will old software handle new
    data without problems. In the future, if variation selectors
    were correctly implemented by *all* software, the default
    processing of variation selectors might allow 'old' software
    to handle new data as if it was old data in some of the
    scenarios. That, however, is not the state of the art.

    New software will handle old data as before, except in those
    scenarios where only one variation sequence or one new character
    is added. The assumption is that such a scenario would be
    selected only if the shape used for AB would remain unchanged.
    Under that assumption, new software would continue to display
    old data as before.

    The fact that new software and new fonts will be needed to
    support any of these scenarios, other than 'do nothing' is
    a temporary issue. It is no different than for any other
    addition of characters.

    Role of variation selectors:

    Semantically, variation selectors are intended to act as if
    they didn't exist. In other words, processes that act on
    the content of the data are assumed to ignore variation
    selectors, whereas processes acting on the appearance of
    data are supposed to take variation selectors into account.

    Because of this, where a distinction in content is desired,
    the encoding of new characters should be considered instead
    of the addition of variation selectors. Where the issue is
    one of mere appearance, variation sequences can be an

    In other words, variation selectors should not be used to
    encode optional semantic differences, but only optional
    glyphic differences.


    This archive was generated by hypermail 2.1.5 : Wed Mar 02 2005 - 14:36:27 CST