Re: Unicode Collation Algorithm

From: Mike (
Date: Thu Apr 27 2006 - 13:53:04 CST

  • Next message: Mike: "Unicode Collation performance"

    >> I am implementing the UCA and am having trouble
    >> passing the conformance test....
    > The problem lies in the interpretation of 'combining mark'. I'd taken
    > it to mean a character with non-zero combining class. Moreover, I think
    > this is what was intended!

    That was the problem. I modified my code to stop
    trying to form contractions when a combining mark
    of class 0 is encountered. Now it passes the
    conformance tests (as long as I throw out level
    four collation data in the NON_IGNORABLE test).

    > I was able to get through the test - once I'd decided that unpaired
    > surrogates should not be converted to the replacement character!

    Well I had to ignore the tests with surrogates in
    them. All my code deals in UTF-8 strings, so to
    be conformant in UTF-8 processing, an exception is
    raised when a surrogate (paired or not) is found.
    I am comfortable with that.

    > I think the rule should be amended by replacing 'combining mark' by
    > 'character of non-zero combining class', but a more elegantly phrased
    > alternative would be still better.

    Yes, that would eliminate the confusion.


    This archive was generated by hypermail 2.1.5 : Thu Apr 27 2006 - 13:57:47 CST