Re: VOWEL, CONSONANT, ...: allow recognition of shorter names?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Apr 11 2008 - 13:35:56 CDT

  • Next message: Michael Everson: "Re: VOWEL, CONSONANT, ...: allow recognition of shorter names?"

    Henrik said:

    > I noticed that for some scripts, e.g. Khmer, character names are still
    > a mouthful. I also noticed that when I additionally ignored
    > CONSONANT, VOWEL, and INDEPENDENT, the Unicode names are still unique
    > and it would improve writing (at least) Khmer character names a lot.
    >
    > I was wondering whether it would be feasible to tighten the condition
    > in TR#34 so that no upcoming Unicode versions had ambiguous names if
    > CONSONANT, VOWEL, and INDEPENDENT were ignored, too.

    As Mark indicated, this is always something you could formally
    propose to the UTC as something for them to consider.

    Personally, however, I would not be in favor of this kind of
    change.

    First, it further complicates the checking that has to be done
    when new characters, formal name aliases, and named sequences are
    proposed. Granted, this can all be done mechanically, but it
    is already something that requires a specialized algorithm
    not generally available (or often understood) by character proposers
    or those reviewing the proposals.

    Second, any such restriction would have to be written into
    ISO/IEC 10646, as well as the Unicode Standard. I can tell
    you from experience that it was a considerable problem getting
    even the limited constraints now documented to consensus for
    documentation in 10646, and getting that through ballots and
    publication. National Bodies are (justifiably, I think) concerned and
    worried about algorithmic constraints on their ability to
    name things, particularly when the constraints get complicated
    to the point that they can't remember all the details or
    envision being able to check manually for uniqueness.

    The requirement that the unique namespace include formal aliases
    and named sequences, as well as character names per se, has
    already pushed this constraint off the edge, in terms of the
    degree of complication that the average standardizer will
    tolerate.

    > Of course, there may be more ignorable words, so the question is where
    > to stop. 'VOWEL' is in 360 words, which is more than 'CHARACTER',
    > which is in only 106. But CONSONANT and INDEPENDENT are relatively
    > seldom. Here are a few other words that occur very frequently that
    > can currently be ignored without ambiguity:
    >
    > VOWEL in 360 names
    > CONSONANT in 66 names
    > INDEPENDENT in 19 names (seldom, but also a mouthful)
    > SYLLABICS in 630 names
    > LIGATURE in 508 names
    > FORM in 798 names
    > PATTERN in 297 names

    This illustrates the problem: where *do* you stop? I have run
    into similar data from another point of view -- in examining
    the Unicode names list for redundancies that allow creation
    of specialized algorithms to pack it down into much smaller
    storage without making use of generic compression algorithms
    like LZW.

    >
    > For stability reasons, it would be very nice if we knew that upcoming
    > Unicode versions had the same nice unambiguity, because then I could
    > officially ignore those words so my users could enjoy more concise
    > character names.

    It is unlikely that the UTC or WG2 will depart significantly from
    the patterns they already have in naming characters. And that
    means that you'd likely be pretty safe in assuming you could
    ignore (and or delete) such redundant terms when doing name
    recognition.

    But as an example of the pitfalls here, "VOWEL" and "LETTER"
    could both be deleted out without loss of uniqueness, but
    "VOWEL SIGN" cannot be. Vide: DEVANAGARI LETTER I versus
    DEVANAGARI VOWEL SIGN I. But if you just omit "LETTER" and
    "SIGN" in this case, you end up with shortened names that
    aren't actually very a propos for Devanagari: DEVANAGARI I
    versus DEVANAGARI SIGN I. More appropriate shortenings would
    be to DEVANAGARI INDEPENDENT I versus DEVANAGARI I or
    perhaps DEVANAGARI LETTER I versus DEVANAGARI MATRA I or
    something else.

    In general, I don't think that simple algorithmic transforms
    on the Unicode names list do a very good job of creating
    the most usable names for end users.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Apr 11 2008 - 13:38:35 CDT