Re: Unicode Collation Algorithm

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Wed Apr 26 2006 - 18:51:25 CST

  • Next message: Andreas Prilop: "Re: Unicode fonts"

    ----- Original Message -----
    From: "Mike" <mike-list@pobox.com>
    To: <unicode@unicode.org>
    Sent: Wednesday, April 26, 2006 8:48 PM
    Subject: Unicode Collation Algorithm

    > Hello,
    >
    > I am implementing the UCA and am having trouble
    > passing the conformance test. The problem is
    > that I believe my code is correct and the test
    > is wrong. For example the sequence:
    >
    > 09C7 1D165 09BE 0061
    >
    > is supposed to come before
    >
    > 09C7 0001 09D7 0061
    >
    > according to the test. What I am observing is
    > that 09C7 combines with 09BE according to steps
    > S2.1.1 thru S2.1.3. The intervening 1D165 is
    > ignored since it is not of combining class 0.
    > The combination 09C7 09BE becomes 09CB, which
    > sorts after 09C7.

    The sequence 09C7 1D165 09BE 0061, it is not canonically equivalent to 09C7
    09BE 1D165 0061, so, except for the alogrithm definition, why should it
    automatically be sorted as equivalent? (The character U+09BE is a spacing
    combining mark of combining class 0.)

    > Note that this is the NON_IGNORABLE test. I
    > have the same problem with the SHIFTED test.
    > And also this is for version 4.1.0.

    > If I comment out the code that implements
    > steps S2.1.1 through S2.1.3, then things break
    > that were working correctly. Has anyone been
    > able to resolve this problem?

    I was able to get through the test - once I'd decided that unpaired
    surrogates should not be converted to the replacement character! However,
    looking at my code, I think my implementation of S2.1 may be wrong!

    The problem lies in the interpretation of 'combining mark'. I'd taken it to
    mean a character with non-zero combining class. Moreover, I think this is
    what was intended! Having built up a maximal sequence S of consecutive
    characters (in form NFD) that has a match in the collation element table,
    one then looks for a subsequent, but not immediately subsequent, element C
    that can be added to extend the matching sequence. The element C cannot be
    added to S ('is blocked') if an element of combining class zero intervenes,
    or an element of the same combining class as C. This is almost the same as
    saying that C can be added to S if there is a canonically equivalent
    character sequence in which C does follow S. It is not the same, for S2.1.2
    does not prohibit C from being of combining class 0, but if C is of
    combining class 0, then moving C next to S creates a string that is not
    canonically equivalent to the original sequence.

    The sentence, 'The reason for considering the extra combining marks C is
    that otherwise irrelevant characters could interfere with matches in the
    table' does not greatly clarify matters. The example it gives is of <a,
    combining_cedilla, combining_ring> being treated as <a, combining_ring,
    combining_cedilla> when <a, combining_cedilla> has no special significance
    (i.e. no contraction of its own). The point here is that there is in
    general no fundamental precedence of combining types - the canonical order
    is essentially arbitrary, but one is needed for definiteness. (There is a
    slight naturalness in the numbering of canonical classes, but not much.)
    However, there is a very definite bracketing of association when two
    European marks have the same combining class.

    I think the rule should be amended by replacing 'combining mark' by
    'character of non-zero combining class', but a more elegantly phrased
    alternative would be still better.

    Richard.



    This archive was generated by hypermail 2.1.5 : Wed Apr 26 2006 - 18:56:37 CST