Re: String name and Character Name

From: Richard T. Gillam (
Date: Thu Apr 21 2005 - 08:22:30 CST

  • Next message: Peter Kirk: "Re: String name and Character Name"

    [Meant this to go to the list. Sorry, Peter...]

    -----Original Message-----
    From: Richard T. Gillam
    Sent: Wednesday, April 20, 2005 6:29 PM
    To: 'Peter Kirk'
    Subject: RE: String name and Character Name

    >A list of character names is useful only if it is entirely reliable, or
    >at least is moving towards being so. If this list contains only one
    >error (and there are a lot more) which is not going to be corrected,
    >then the list is worthy of nothing but to be thrown out and replaced -
    >if only by another almost identical list, which can be corrected.

    Is it just me, or is this topic getting kind of out of hand, and maybe a
    bit unnecessarily heated?

    If the character names are simply intended to be alternate internal
    identifiers for the characters-- alternatives that are a little more
    mnemonic than the hex code point values-- they seem to be serving their
    purpose perfectly well. In fact, almost anything would work. You could
    say the name for U+0041 is "SDFLKJSDLFJSLK" and it'd work fine. (Okay,
    that's not too mnemonic. Maybe "POINTY THING WITH CROSSBAR".) In fact,
    if they're official internal identifiers, having them be consistent is
    way more important than having them be mnemonic.

    But because they were originally intended to be mnemonic, they wind up
    taking on a resonance beyond just being programmatic identifiers. They
    appear to describe the thing they identify. In most cases, they do. In
    some cases ("<control>", "CJK UNIFIED IDEOGRAPH-XXXX"), they really
    don't. And in a few cases, they mislead.

    The problem seems to be in expecting these identifiers to do more than
    they were intended to do. I would argue that software that exposes them
    in a user interface is pushing them beyond their boundaries (at least if
    anyone other than the Unicoderati is to use them). Even in cases where
    the names _are_ descriptive, I'm not sure they should be used (at least
    exclusively) in user interfaces-- if I can find U+002E only by searching
    for "FULL STOP" and not be searching for "PERIOD", I lose, and if my
    native language isn't English, I lose no matter what they say. But is
    this the fault of the Unicode standard or the fault of the application?

    Maybe what the character names do and do not represent could be better
    documented (I didn't look exhaustively, but a quick check in a few of
    the obvious places didn't turn up an explanation of the "name"
    property). And maybe it'd be worth it to add another character
    property, "Alternate Names", where corrections and alternatives to the
    formal name could be placed (maybe with some indication of when the
    formal name is misleading). Applications that operate on character
    names would then have a machine-readable list of alternatives they
    should recognize in addition to the formal names.

    Failing this, it seems to me that things like Andrew West's "Unicode
    Bloopers" list or the "decode Unicode" project can help a lot here,
    although part of me feels like the names that are just flat wrong really
    ought to be called out in the standard somewhere, or at least on the
    Unicode Web site somewhere.

    By the way, one famous "blooper" missing from the "Unicode Bloopers"
    page is U+2118 SCRIPT CAPITAL P, which is neither script nor capital
    (although it is, at least, a P).

    --Rich Gillam
      Language Analysis Systems, Inc.

    This archive was generated by hypermail 2.1.5 : Thu Apr 21 2005 - 08:23:24 CST