Re: Contiguous Weight Ranges and Ignorables

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Apr 17 2008 - 19:38:17 CDT

  • Next message: Otto Stolz: "Re: Using combining diacritical marks and non-zero joiners in a name"

    Jesse Hallam asked:

    > I am pursuing an implementation of the UCA, and am attempting to employ the
    > table reduction technique known in the UCA as "Contiguous Weight Ranges". In
    > that technique, we read the following:
    >
    > *Whenever collation elements have different primary weights, the ordering of
    > their secondary weights is immaterial.*
    >
    > I clearly see how this applies to collation elements with different,
    > primary, non-zero weights. How can this statement hold true for primary
    > ignorables?

    If you are simply comparing two collation elements, then if one has
    a non-zero primary weight and the other as a zero primary weight, then
    the comparison depends again only on the primary weight, and the secondary
    weights are irrelevant.

    Anticipating the example below, if you are simply comparing:

    U+0061 LATIN CAPITAL LETTER A --> [.1141.0020.0008.0041]

    and

    U+0332 COMBINING LOW LINE --> [.0000.0021.0002.0332]

    Then the comparison is decided entirely by the primary weight, so
    U+0332 < U+0061, and you don't care that the secondary weight in the
    collation element for U+0332 is larger than the secondary weight in
    the collation element for U+0061.

    > For example, consider line 27167/27168 of CollationTest_NON_IGNORABLE.txt:
    >
    > *1E00 0334; # () LATIN CAPITAL LETTER A WITH RING BELOW [0FD0 | 0020
    > 008C 0080 | 0008 0002 0002 |]*
    > *0332 0061; # () COMBINING LOW LINE [0FD0 | 0021 0020 | 0002 0002 |]*
    >
    > After normalization, we are comparing the code points:
    >
    > *<41><334><325>
    > <332><61>
    > *

    Then you are comparing *strings*, not single collation elements.
    When you weight out the full strings (after normalization), you get
    (I'm using the UCA 5.1 tables here, so the primary weights differ from
    those in the CollationTest file you cite above, but the end results are the
    same):

    [.1141.0020.0008.0041][.0000.008C.0002.0334][.0000.0080.0002.0325]

    versus:

    [.0000.0021.0002.0332][.1141.0020.0002.0061]

    When you use those vectors of collation elements to then construct
    the comparison, you get, for the first 3 levels:

    [1141 | 0020 008C 0080 | 0008 0002 0002 ]
    [1141 | 0021 0020 | 0002 0002 ]
            ^^^^
            
    And the *string* comparison is decided on the basis of the first
    comparison of secondary weights. The primary weights don't make
    the difference here, because there is only one primary weight in
    either string, and it is identical. So you move on, and the secondary
    weights then make the difference.

    The particular example illustrated with the two lines in CollationTest
    may be a bit obscure, because the second string contains a defective
    combining character sequence, and is testing an unusual edge case.

    But in principle, this is no different than the outcome you
    would get in comparing, for example, <a, acute> versus <a, grave>,
    where it would be the secondary weight of the acute or the grave
    that would make the difference for the comparison.

    >
    > These compare equal on a primary level (since only <41> and <61> have
    > primary weights, both of which are equal).

    Correct.

    > The comparison then proceeds to
    > compare <41> and <332>.

    No, that's where you are heading astray. Once you convert the character strings
    to arrays of weights, you are doing the following:

    Compare 1141 to 1141 --> equal, continue
    Compare 0020 to 0021 --> unequal, exit with result.

    > Noting that <41> and <332> have different primary
    > weights (<332> is, of course, a primary ignorable), we nevertheless see that
    > the ordering of their secondary weights is critical. Were my implementation
    > of the UCA to re-weight each secondary level according to the "Contiguous
    > Weight Ranges" technique, I may very well obtain an incorrect collation
    > result in this example.

    Not if properly done.

    >
    > I'm certain I am simply missing something in the language of the UCA. For
    > one, I note that the example given in the UCA for this technique renumbers
    > the secondary weights for the letter 'O', restricting the lower bound to the
    > initial lower bound of 0020; I see nothing in the language that would
    > prevent me from starting that lower bound lower, perhaps at 0002, yet for
    > some reason, this was not done.

    The allkeys.txt table is deliberately constructed so as to enable easy
    use of this technique.

    Tertiary weights are all in the range 0002..001F inclusive.
    Secondary weights (for UCA 5.1) are all in the range 0020..01DE inclusive.
    Primary weights start at 0200 and go up from there.

    These are deliberately not overlapped and are deliberately set up so
    that all primary weights (except zero) are greater than all secondary
    weights (except zero), which in turn are greater than all tertiary
    weights.

    >
    > Also, under "3.1.4 Default Values", we read:
    >
    > *Both in the Default Unicode Collation Element Table and in typical
    > tailorings, most unaccented letters differ in the primary weights, but have
    > secondary weights (such as **a1) equal to **MIN2. The primary ignorables
    > will have secondary weights greater than **MIN2. *

    And in allkeys.txt, the actual value of MIN2 is 0020.

    >
    > Why primary ignorables will have weights greather than MIN_2 is not
    > specified, but perhaps this is a hint to implementors such as myself. Does
    > it relate to the above issue? I'm not certain.

    See above. Hope that helps.

    --Ken

    >
    > Any insight or clarification into the above matter would be greatly
    > appreciated!



    This archive was generated by hypermail 2.1.5 : Thu Apr 17 2008 - 19:41:01 CDT