Re: String name and Character Name

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat Apr 23 2005 - 14:42:43 CST

  • Next message: Asmus Freytag: "Re: String name and Character Name"

    At 06:56 AM 4/23/2005, Peter Kirk wrote:
    >On 21/04/2005 16:22, Doug Ewell wrote:
    >
    >>If the move is on to encourage software vendors to develop their own
    >>proprietary lists of "accurate" character names for character-map UIs
    >>and such, instead of using the official, non-perfect Unicode character
    >>names, ...
    >
    >Has anyone actually suggested this? In my opinion, non-standardised
    >proprietary names are even worse than the official but sometimes
    >inaccurate names. What we need is a list which is both correct (or at
    >least correctible when errors are found!) and standardised. And I accept
    >that CLDR rather than Unicode proper may be the best place to go for this.

    As has been noted here before, the use that SC2 had for unique names was to
    make sure that the characters in the 8859 series (and potentially other SC2
    and ISO standards) could easily and uniquely be correlated to their 10646
    counterparts. (Some of the early arguments about character names were
    driven by the fact that 8859 and 10646 names were not identical).

    If that is your primary purpose, then only a single, standard and immutable
    list of character names will do. Multiple lists are merely a useless
    annoyance, but multiple non-standard list are worse than useless, as there
    is the very real potential of names that cannot be correlated to their code
    point unless one has access to a private list, or worse, multiple lists
    using the same name for two different characters.

    At the moment, most of this discussion is theoretical - there is a need for
    people to surface some names for characters in user interfaces, but it is
    not clear what the effective constraints on that process are; typos are
    annoying to users, but not harmful; some of the character names, while
    misleading, are not problematic enough to overcome a pretty clear
    identification of the character via its representative glyph; users report
    confusion even for some properly constructed character names.

    But in the spirit of hypothesizing a solution, I would consider using an
    alias mechanism in the way aliases are used for Property names the best
    solution. For properties (and their values) there exist multiple aliases,
    which are all considered unique.

    This mechanism has been used to fix typos in the name of properties. For
    example the linebreak property called "inseparable" had been called
    "inseperable". Instead of changing that name, the correct name has become
    the preferred alias and the incorrect name has been retained as an alias.
    (A similar thing was done for an incorrect block
    name: "Cyrillic_Supplement" instead of the incorrect
    "Cyrillic_Supplementary"). The benefits of such a solution are:

    1) users can use a 'correct' name to refer to a property and don't need to
    use an 'incorrect' name
    2) users are guaranteed that software will continue to understand the old
    name, as all aliases are considered equivalent descriptions of the property
    3) the UTC guarantees that all aliases from the same name space are unique
    4) users can rely on that no alias will be retired

    The current use of aliases for Unicode *character* names does not follow
    any of these rules. They are merely alternate names that are known to be
    used by some user community. However, if people other than Peter Kirk
    consider the current situation in need of a formal solution, then this more
    formal form of aliasing would be a way forward. It would have the benefit
    of making all the naming, and name stability rules for entities related to
    the Unicode Standard more uniform. At the same time, as long as one of the
    aliases is formally identified as the alias corresponding to the 10646
    character name, there is no direct synchronization issue. Unicode has
    always provided additional information for characters.

    How could this be done? One very limited way would be to add to the list of
    Unicode1.0 character names. That would allow the use of a single alternate
    formal alias for characters, which should be quite suitable for corrections
    to the names with obvious errors. These would be printed with special
    convention (for example all uppercase). The existing use of informal
    character name aliases (in lower or mixed case) would continue as before.

    A more extensive approach would be to introduce a full-fledged
    CharacterNameAliases.txt file, which would not put an arbitrary constraint
    on the number of aliases. Even in this case, the aliases in the file should
    be restricted to formal aliases only, which would tend to keep their number
    between 1 and 2 for almost all characters (the original name being
    considered an alias as well, the numbers are 1 and 2, rather than 0 and 1).

    This is pretty far from an actual proposal, but I wanted to point out that
    we have solved a related problem in the space of property names in the
    meantime, so perhaps now would be the time to consider whether our issues
    with character names are severe enough to warrant working out such a solution.

    A./

    PS: the property value aliases are found in
    http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
    and the property aliases are found in
    http://www.unicode.org/Public/UNIDATA/PropertyVAliases.txt

    Note that each property has a separate name space for its values,
    so that both Script and Block can have a value of "Cyrillic".



    This archive was generated by hypermail 2.1.5 : Sat Apr 23 2005 - 14:43:51 CST