UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jul 04 2010 - 19:15:06 CDT

  • Next message: Jonathan Coxhead: "Re: Draft Proposal to encode the English Phonotypic Alphabet"

    Collation (for French) normally uses backwards ordering of collation
    weights at level 2:

    «
      4.3 Form Sort Key
      Step 3. The sort key is formed by successively appending all
    non-zero weights from the collation element array. The weights are
    appended from each level in turn, from 1 to 3. (Backwards weights are
    inserted in reverse order.)
    »

    However I think that this creates over-long sequences which would
    reverse ALL secondary weights of arbitrarily long texts. Not only this
    rule would have a severe performance impact, but this is actually not
    needed for French.
    What is needed is JUST to reverse the collation weights associated to
    single words (or compaound words, including those including an
    apostrophe). So the reversal should only apply to separate spans of
    text after word-breaking (see UAX #29).

    For example, with the sentence
      «Pour être heureux, ne vivons pas cachés ! »,
    it's much enough to reverse the secondary weights like in this sentence :
      <span>Pour </span>
      <span>être </span>
      <span>heureux, </span>
      <span>ne </span>
      <span>vivons </span>
      <span>pas </span>
      <span>cachés !</span>

    Using a (UAX#10) word-breaking step (based on "extended grapheme
    clusters" as above, or on shorter "legacy grapheme clusters" where
    spaces, punctutations and spacing marks would be separated, should be
    used at end of steps 4.1 before step 4.2 of the UCA algorithm.

    And step 4.3 need just to be applied between those word-breaks,
    instead of on the complete string.

    And then, this will correctly sort an itemized list of definitions like:
       * être (en anglais, “to be”) : v. aux. irrégulier du 3è groupe
       * été (en anglais, “summer”) : n. m. – 2è saison de l’année.
       * ...
    Or other simpler lists of person names, toponyms, book titles...
    because it would actually apply the reversal of accent differences
    only within the first word of each item (other words would still be
    treated only if two items have the same initial word.

    Note that the punctuations and spaces that may cause a word-break to
    be detected, will often be ignored on the 2 first levels of collations
    (i.e. they would have a 0000 collation weight at these levels),
    notably in collations tailored for specific locales (such as French)
    and not the generic locale-neutral collation (in the "root" locale of
    CLDR and using the DUCET).

    Can the UTS#10 (currently in review) about the UCA algorithm speak
    about where a word breaker may be used ? This would also offer huge
    optimization opportunities for computing collation weights in most
    languages (not just French). Notably because it will reduce a lot the
    internal buffering needed to create each substring of collation weight
    for each separate collation level.

    And it would be useful to reserve in the DUCET a specific collation
    weight, at the primary level (with a lower value than the value of the
    collation-level separator, if it is used), or a range of such weights,
    that could be used for word separation (or other kinds of hierarchical
    logical separation) could really speedup the process of computing
    collation weights for long sentences (notably, it would allow
    collation strings to be appended directly on the fly by separating
    them with this separator weight).

    And my opinion is that, by default, at least the most basic
    word-breaker (on breakable whitespaces including explicit linebreak
    controls, possibly on sentences breaks if available) should be used to
    limit the effect of backwards reordering of collation weights at any
    level, in any practical implementation of the UCA (and notably in
    implementations of UCA with the French locale, in database engines for
    building their index and for supporting the « ORDER BY » clause and
    text compare operators like >, <, >=, <=, and « BETWEEN...AND », and
    aggregates line « MIN() » and « MAX() », and operators based on text
    similarity such as =, !=, and « LIKE »).

    Philippe.



    This archive was generated by hypermail 2.1.5 : Sun Jul 04 2010 - 19:21:39 CDT