RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 13 2010 - 09:20:10 CDT

  • Next message: announcements@unicode.org: "Unicode 5.0 now in Chinese"

    > De : "CE Whitehead" <cewcathar@hotmail.com>
    > A : verdy_p@wanadoo.fr, unicode@unicode.org
    > Copie à :
    > Objet : RE: UTS#10 (collation) : French backwards level 2, and word-breakers.
    >
    >
    > Hi, I am sort of confused; so there is no way now to put some of the weights in reverse order at the secondary level while skipping word boundaries?
    > Philippe Verdy's suggestion seems reasonable, in general; however I think that not reversing the weights at word boundaries at level 2 should be simply an option for French; also I do believe that there is already a way to identify word boundaries at the primary level in the DUCET but I may be wrong -- that is the characters that define word boundaries, non-spacing characters, white space, are defined.
    >
    > So is the point then to define all word separators -- whether in the form of white space, a mandatory line break, etc. -- with a single weight in the DUCET? (Sorry to be so confused.)

    There's already a standard annex covering word boundaries, and other
    boundaries : lines, sentences, default grapheme clusters (including
    ZWJ and ZWNJ, and the 8 Thai/Lao prepended characters, and sequences
    that include double diacritics)... plus the combining sequence
    boundaries that are part of the core standard for normalizations.

    Word boundaries are within the simplest boundaries to compute (at
    least for alphabetic scripts (this is certainly more complex for East
    Asian scritps, but the same scripts, but these boundaries are not very
    useful for collation purpose).

    No need to reinvent the wheel specifically for UTS#10 collation, which
    already needs the default grapheme cluster boundaries, as the smallest
    boundaries (spanning entirely one or more combining sequences.),
    possibly extended to cover multiple default grapheme clusters in
    language-spacific clusters (for M-to-1 and M-to-N weight mappings).

    Yes I know also that isolated combining characters may also receive
    their own collation weights, because they are not necessarily combined
    within M-to-1 or M-to-N weight mappings.

    But the way UCA and the DUCET is built is to make sure that the result
    will be consistant within at least the default grapheme clusters,
    independantly of language-specific tailorings (but I'm not sure that
    the UCA algorithm addresses all the consistancy issues to make sure
    that this will be true for all the default grapheme clusters,
    including in tailorings, when only the combining sequence boundaries
    are really secured).

    UTS#10 is also helping us to define the "non-default" grapheme
    clusters perceived in various languages. It is really a complement to
    the existing UAX for boundaries, that goes beyond just the purpose of
    sorting and can be used even without considering any collation weights
    and independantly of collation levels. For example these boundaries
    can be used in full text indexing, in orthographic correctors, and in
    semantic analysis of encoded texts, and they may also help for
    enhancing the usability of text editors, or for text
    selection/extraction in browsers.

    As the introduction of backwards levels in UCA was made apparently
    specifically for French collation, it really forgot one aspect of
    French collation: that this is only wanted within single words (the
    most significant secondary differences of accents are to be found at
    end of each word separately, but not at end of texts of arbitrary
    length).

    And even if Kenneth at Sybase thinks that this would complicate things
    or slow down collation, my own experience demonstrates just the
    opposite, exactly for French collation (he recognizes himself that
    French UCA collation is slow, but the cause of this slowness is
    because word boundaries were forgotten an algorithm that is already
    considering smaller boundaries, and he also admits that a small input
    buffering will be needed for correct handling of normalization,
    canonical equivalence of results, and Unicode process conformance),
    and it has absolutely no impact on the effective performance for
    collations without backwards level (including the default
    locale-neutral "root" collation directly induced from the DUCET).

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Tue Jul 13 2010 - 09:25:34 CDT