Re: Unique character names

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 04 2007 - 13:31:54 CST

  • Next message: Kenneth Whistler: "RE: Unicode 5.1, Egyptian Transliteration, and Fonts"

    > > I'm wondering if this rule applies to the string "LETTER" in the
    > > following character names:
    > >
    > > U+210C BLACK-LETTER CAPITAL H
    > > U+2111 BLACK-LETTER CAPITAL I
    > > U+211C BLACK-LETTER CAPITAL R
    > > U+2128 BLACK-LETTER CAPITAL Z
    > > U+212D BLACK-LETTER CAPITAL C
    > it most certainly does.
    > >
    > > In other words, would a hypothetical character name "BLACK CHARACTER
    > > CAPITAL H" violate this rule?
    > >
    > > (This is not meant as a joke, by the way; I'm playing around with
    > > algorithms for efficient storage of character names.)
    > Believe it or not, the Consortium uses software to make sure that these
    > rules are followed.
    >
    > A./
    >
    > PS: I know, I wrote one of the tools used in checking drafts of the
    > nameslist during my tenure as code chart editor.

    In addition to the check that Asmus wrote into the tools for
    checking drafts of the names list, I have an independently
    written tool that also checks for code point and name
    duplications, including the loose name matching rules, as well as
    out-and-out duplications (which might happen in name list
    preparation, since a lot of copy/paste is involved in
    editing initial name lists for proposals, typically).

    Note that the scope for name duplication is the union
    of the *character* names in UnicodeData.txt (and the
    generated, but non-problematical unified ideograph
    names and Hangul syllable names) and the *named sequences*
    in NamedSequences.txt. So the checking has to be done
    for both of those together, and not just in UnicodeData.txt.

    Run against the current UnicodeData-5.1.0.txt and
    NamedSequences-5.1.0.txt, that tool correctly detects
    the one grandfathered exception to the loose name matching
    rule:

    116C HANGUL JUNGSEONG OE
    1180 HANGUL JUNGSEONG O-E

    I also run the tool against an artificially hacked up
    UnicodeData.txt with various bogus additional characters
    added, with names like SUBSCRIPT DIGIT THREE and
    SUBSCRIPTDIGIT THREE (as opposed to the actually encoded
    SUBSCRIPT THREE), to verify that the tool would actually
    detect other possible classes of duplications.

    I just added "BLACK CHARACTER CAPITAL H" to the test file,
    and it popped right out as a violation of the name
    duplication rule.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 04 2007 - 13:35:34 CST