RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (
Date: Mon Jul 12 2010 - 22:33:50 CDT

  • Next message: Jeroen Ruigrok van der Werven: "Re: Bengali Script"

    > De : "Kenneth Whistler" <>
    > Philippe Verdy wrote:
    > > "Kenneth Whistler" <> wrote:
    > > > Huh? That is just preprocessing to delete portions of strings
    > > > before calculating keys. If you want to do so, be my guest,
    > > > but building in arbitrary rules of content suppression into
    > > > the UCA algorithm itself is a non-starter.
    > >
    > > I have definitely not asked for adding such preprocessing.
    > No, effectively you just did, by asking for special weighting for
    > things like parentheticals at the end of strings.

    And you have completely misinterpreted it. I have not said that it
    implied preprocessing for special weighting.

    It was used as a justification for the fact that the end of the string
    needs not be scanned at all, in the VERY FREQUENT cases where the
    string contains multiple words (including parenthetical precisions,
    but not limited to this case, but to any kind of phrase, sentence or
    text), and why we should be able to compare all collation levels
    within each word isolately, to fully determine if the rest of the
    string needs to be scanned (if they compare as binary identical).

    The only special weighting I spoke about, was related to a single
    special empty collation element (implicitly FFFF.0000.0000.0000, with
    FFFF treated as -1) whose insertion will be useful between fields of
    multi-field sorts (such as with SQL's SELECT ORDER BY and GROUP BY
    clauses), and will be requiring shifting all primary weights by 1, if
    you want to get positive values only (but note that when serializing
    collation weights into a byte stream for computing collation keys,
    additional shiftings will occur anyway, to avoid signed bytes ordering
    problems in Java for example, or simply to compress them to just the
    number of bits needed for each collation level). In fact, this does
    not require any change the format or data of the DUCET, as there will
    never be any collision of primary weights.

    When just generating sort keys strings, the unsigned 16-bit collation
    weights decoded from the DUCET or tailored tables, will be stored in a
    standard 32-bit or 64-bit signed integer register or local variable,
    that can perfectly fit the additional -1 value in the special primary
    level for the field separator, and when comparing strings, it is not
    even needed (you just have to know if the other field from the other
    compared row is also at end, because you'll compare these fields

    -- Philippe.

    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 22:35:49 CDT