Contiguous Weight Ranges and Ignorables

From: Jesse Hallam (unicode.org@fentrax.com)
Date: Thu Apr 17 2008 - 13:17:21 CDT

  • Next message: vunzndi@vfemail.net: "Re: Using combining diacritical marks and non-zero joiners in a name"

    [I accidentally sent this message between subscribing and confirming my
    response. I do not know if it arrived. I apologize if this is received in
    duplicate]

    Good day,

    I am pursuing an implementation of the UCA, and am attempting to employ the
    table reduction technique known in the UCA as "Contiguous Weight Ranges". In
    that technique, we read the following:

    *Whenever collation elements have different primary weights, the ordering of
    their secondary weights is immaterial.*

    I clearly see how this applies to collation elements with different,
    primary, non-zero weights. How can this statement hold true for primary
    ignorables?

    For example, consider line 27167/27168 of CollationTest_NON_IGNORABLE.txt:

    *1E00 0334; # () LATIN CAPITAL LETTER A WITH RING BELOW [0FD0 | 0020
    008C 0080 | 0008 0002 0002 |]*
    *0332 0061; # () COMBINING LOW LINE [0FD0 | 0021 0020 | 0002 0002 |]*

    After normalization, we are comparing the code points:

    *<41><334><325>
    <332><61>
    *

    These compare equal on a primary level (since only <41> and <61> have
    primary weights, both of which are equal). The comparison then proceeds to
    compare <41> and <332>. Noting that <41> and <332> have different primary
    weights (<332> is, of course, a primary ignorable), we nevertheless see that
    the ordering of their secondary weights is critical. Were my implementation
    of the UCA to re-weight each secondary level according to the "Contiguous
    Weight Ranges" technique, I may very well obtain an incorrect collation
    result in this example.

    I'm certain I am simply missing something in the language of the UCA. For
    one, I note that the example given in the UCA for this technique renumbers
    the secondary weights for the letter 'O', restricting the lower bound to the
    initial lower bound of 0020; I see nothing in the language that would
    prevent me from starting that lower bound lower, perhaps at 0002, yet for
    some reason, this was not done.

    Also, under "3.1.4 Default Values", we read:

    *Both in the Default Unicode Collation Element Table and in typical
    tailorings, most unaccented letters differ in the primary weights, but have
    secondary weights (such as **a1) equal to **MIN2. The primary ignorables
    will have secondary weights greater than **MIN2. *

    Why primary ignorables will have weights greather than MIN_2 is not
    specified, but perhaps this is a hint to implementors such as myself. Does
    it relate to the above issue? I'm not certain.

    Any insight or clarification into the above matter would be greatly
    appreciated!

    -- 
    Jesse Hallam
    University of Waterloo Junior
    "For scarcely for a righteous man will one die: yet peradventure for a good
    man some would even dare to die. But God commendeth his love toward us, in
    that, *while we were yet sinners*, Christ died for us. " (Romans 5:7, 8)
    


    This archive was generated by hypermail 2.1.5 : Thu Apr 17 2008 - 13:27:01 CDT