RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: CE Whitehead (cewcathar@hotmail.com)
Date: Tue Jul 13 2010 - 19:25:45 CDT

  • Next message: Mark Davis ☕: "Re: Bengali Script"

    Hi. Thanks Philippe very much for your information.
    Unfortunately I must admit that I have not the experience using the collation algorithm in sorts of words or strings of words -- so I cannot help back up your data on efficiency.
    If a low-weight field/space separator provides uniformity then perhaps I can vote for your low-weight field separator -- would it be at all like http://www.unicode.org/reports/tr10/tr10-21.html#Combining_Grapheme_Joiner ?
    (I still think a space character's level can simply be customized, and that this character might be used to produce a different sort between di lillo and dilillo and also be used to block reordering and allow word-by-word comparison with the right code.)
    Word-by-word comparison should be more efficient of course as it does enable you to stop after reading in enough words to distinguish one string from another.
    At the same time, as I think about it, I can think of no proper names in French where the words differed only by level 2 accents* -- although I am not French.

    So I can see Kenneth's point somewhat here, too.

    Sorry that I am not more help.
     

    (* the French words distinguished by level 2 marks include ou, ou' and then there are the varieties of cote with and without accents; see: http://www.wordreference.com/fren/cote ; the only other thing I can think of are masculine past participles such as r^ape' which sometimes are only distinct from related verbs/nouns/adjectives by an accent;
    this happens in Spanish, particularly among the pronouns;
    see:
    http://users.ipfw.edu/jehle/courses/pronoun1.htm
    [also note that ' e'ste ' is a demonstrative masculine pronoun, 'this one,' while ' este ' is the masculine form of the adjective 'this', or else ' este ' can mean 'East',
    and then ' este' ' is the present subjunctive of 'estar'
    http://www.wordreference.com/es/en/translation.asp?spen=este])

    Best wishes with this,

    --C. E. Whitehead
    cewcathar@hotmail.com

     

    From: Philippe Verdy (verdy_p@wanadoo.fr)
    Date: Tue Jul 13 2010 - 09:20:10 CDT
    > De : "CE Whitehead" <cewcathar@hotmail.com>
    > A : verdy_p@wanadoo.fr, unicode@unicode.org
    > Copie :
    > Objet : RE: UTS#10 (collation) : French backwards level 2, and word-breakers.
    >
    >
    >> Hi, I am sort of confused; so there is no way now to put some of the weights in reverse order at the secondary level while skipping word boundaries?
    >> Philippe Verdy's suggestion seems reasonable, in general; however I think that not reversing the weights at word boundaries at level 2 should be simply an option for French; also I do believe that there is already a way to identify word boundaries at the primary level in the DUCET but I may be wrong -- that is the characters that define word boundaries, non-spacing characters, white space, are defined.
    >>
    >> So is the point then to define all word separators -- whether in the form of white space, a mandatory line break, etc. -- with a single weight in the DUCET? (Sorry to be so confused.)

    > There's already a standard annex covering word boundaries, and other
    > boundaries : lines, sentences, default grapheme clusters (including
    > ZWJ and ZWNJ, and the 8 Thai/Lao prepended characters, and sequences
    > that include double diacritics)... plus the combining sequence
    > boundaries that are part of the core standard for normalizations.
    Yes.
    > Word boundaries are within the simplest boundaries to compute (at
    least for alphabetic scripts (this is certainly more complex for East
    > Asian scritps, but the same scripts, but these boundaries are not very
    > useful for collation purpose).

    > No need to reinvent the wheel specifically for UTS#10 collation, which
    > already needs the default grapheme cluster boundaries, as the smallest
    > boundaries (spanning entirely one or more combining sequences.),
    > possibly extended to cover multiple default grapheme clusters in
    > language-spacific clusters (for M-to-1 and M-to-N weight mappings).
    Yes.
    > Yes I know also that isolated combining characters may also receive
    > their own collation weights, because they are not necessarily combined
    > within M-to-1 or M-to-N weight mappings.

    > But the way UCA and the DUCET is built is to make sure that the result
    > will be consistant within at least the default grapheme clusters,
    > independantly of language-specific tailorings (but I'm not sure that
    > the UCA algorithm addresses all the consistancy issues to make sure
    > that this will be true for all the default grapheme clusters,
    > including in tailorings, when only the combining sequence boundaries
    > are really secured).

    > UTS#10 is also helping us to define the "non-default" grapheme
    > clusters perceived in various languages. It is really a complement to
    > the existing UAX for boundaries, that goes beyond just the purpose of
    > sorting and can be used even without considering any collation weights
    > and independantly of collation levels. For example these boundaries
    > can be used in full text indexing, in orthographic correctors, and in
    > semantic analysis of encoded texts, and they may also help for
    > enhancing the usability of text editors, or for text
    > selection/extraction in browsers.

    > As the introduction of backwards levels in UCA was made apparently
    > specifically for French collation, it really forgot one aspect of
    > French collation: that this is only wanted within single words (the
    > most significant secondary differences of accents are to be found at
    > end of each word separately, but not at end of texts of arbitrary
    > length).

    And even if Kenneth at Sybase thinks that this would complicate things
    or slow down collation, my own experience demonstrates just the
    opposite, exactly for French collation (he recognizes himself that
    French UCA collation is slow, but the cause of this slowness is
    because word boundaries were forgotten an algorithm that is already
    considering smaller boundaries, and he also admits that a small input
    buffering will be needed for correct handling of normalization,
    canonical equivalence of results, and Unicode process conformance),
    and it has absolutely no impact on the effective performance for
    collations without backwards level (including the default
    locale-neutral "root" collation directly induced from the DUCET).

    -- Philippe.
     

                                                   



    This archive was generated by hypermail 2.1.5 : Tue Jul 13 2010 - 19:28:15 CDT