RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: CE Whitehead (cewcathar@hotmail.com)
Date: Tue Jul 06 2010 - 22:38:58 CDT

  • Next message: Tulasi: "Bangladeshi"

    Hi, I am sort of confused; so there is no way now to put some of the weights in reverse order at the secondary level while skipping word boundaries?
    Philippe Verdy's suggestion seems reasonable, in general; however I think that not reversing the weights at word boundaries at level 2 should be simply an option for French; also I do believe that there is already a way to identify word boundaries at the primary level in the DUCET but I may be wrong -- that is the characters that define word boundaries, non-spacing characters, white space, are defined. So is the point then to define all word separators -- whether in the form of white space, a mandatory line break, etc. -- with a single weight in the DUCET? (Sorry to be so confused.)

    From: Philippe Verdy (verdy_p@wanadoo.fr)
    Date: Sun Jul 04 2010 - 19:15:06 CDT
    > Collation (for French) normally uses backwards ordering of collation
    > weights at level 2:
    > «
    > 4.3 Form Sort Key
    > Step 3. The sort key is formed by successively appending all
    > non-zero weights from the collation element array. The weights are
    > appended from each level in turn, from 1 to 3. (Backwards weights are
    > inserted in reverse order.)
    > »

    > However I think that this creates over-long sequences which would
    > reverse ALL secondary weights of arbitrarily long texts. Not only this
    > rule would have a severe performance impact, but this is actually not
    > needed for French.
    > What is needed is JUST to reverse the collation weights associated to
    > single words (or compaound words, including those including an
    > apostrophe).
    > So the reversal should only apply to separate spans of
    > text after word-breaking (see UAX #29).
    Seems valid.
    > For example, with the sentence
    > «Pour être heureux, ne vivons pas cachés ! »,
    > it's much enough to reverse the secondary weights like in this sentence :
    > <span>Pour </span>
    > <span>être </span>
    > <span>heureux, </span>
    > <span>ne </span>
    > <span>vivons </span>
    > <span>pas </span>
    > <span>cachés !</span>

    > Using a (UAX#10) word-breaking step (based on "extended grapheme
    > clusters" as above, or on shorter "legacy grapheme clusters" where
    > spaces, punctutations and spacing marks would be separated, should be
    > used at end of steps 4.1 before step 4.2 of the UCA algorithm.

    > And step 4.3 need just to be applied between those word-breaks,
    > instead of on the complete string.

    > And then, this will correctly sort an itemized list of definitions like:
    > * être (en anglais, “to be”) : v. aux. irrégulier du 3è groupe
    > * été (en anglais, “summer”) : n. m. – 2è saison de l’année.
    > * ...
    > Or other simpler lists of person names, toponyms, book titles...
    > because it would actually apply the reversal of accent differences
    > only within the first word of each item (other words would still be
    > treated only if two items have the same initial word.

    > Note that the punctuations and spaces that may cause a word-break to
    > be detected, will often be ignored on the 2 first levels of collations
    > (i.e. they would have a 0000 collation weight at these levels),
    > notably in collations tailored for specific locales (such as French)
    > and not the generic locale-neutral collation (in the "root" locale of
    > CLDR and using the DUCET).
    These can be re-tailored, right? As things stand currently? Or am I confused?
    > Can the UTS#10 (currently in review) about the UCA algorithm speak
    > about where a word breaker may be used ? This would also offer huge
    > optimization opportunities for computing collation weights in most
    > languages (not just French). Notably because it will reduce a lot the
    > internal buffering needed to create each substring of collation weight
    > for each separate collation level.

    > And it would be useful to reserve in the DUCET a specific collation
    > weight, at the primary level
    > (with a lower value than the value of the
    > collation-level separator, if it is used), or a range of such weights,
    > that could be used for word separation (or other kinds of hierarchical
    > logical separation)
    As I understand things, whitespace and punctuation get a primary level weight in the DUCET, can be ignored or not as needed (see my revision of the sort/collation of names below where, after a string of characters, null before whitespace gets sorted ahead of any alphabetic or non-null non-space character; thus the white space, hard carriage returns, and punctuation are not ignored in this case), and can be used for separating words! Line breaks are defined in UAX 14.
    It would be nice perhaps to define hard carriage returns, white space, and other places where words may be separated and lines broken without a hyphen with a single weight and use that in collation. (Is that what you are getting at? Otherwise I am completely confused; sorry.)
    > could really speedup the process of computing
    > collation weights for long sentences (notably, it would allow
    > collation strings to be appended directly on the fly by separating
    > them with this separator weight).
    { Sorry, I am lost here; will not comment. }
    > And my opinion is that, by default, at least the most basic
    > word-breaker (on breakable whitespaces including explicit linebreak
    > controls, possibly on sentences breaks if available) should be used to
    > limit the effect of backwards reordering of collation weights at any
    > level, in any practical implementation of the UCA
    As an option, this is fine.

    I collate as follows (note that i' is equivalent to i with accent grave):

    (EXAMPLE 1 -- my sort)
    di Silva, Fred,
    di Silva, John
    di Si'lva, Fred
    di Si'lva, John
    Disilva, Fred
    Disilva, John

    and not:

    (EXAMPLE 2: sort from UAX 10 samples)
    di Silva, Fred
    di Si'lva, Fred
    Disilva, Fred
    di Silva, John
    di Si'lva, John
    Disilva, John

    (As an 'aside' or unrelated note: I am kind of shocked by the second ordering -- taken from the example in Table 6, section 1.6, of UAX 10, because that is not how I sort; I suppose the example's purpose is to show
    across-word-boundary collation, but I am still trying to get used to the example; I am apparently what one would call in English a "narrow" person when it comes to collation and sorting.
    I gather however that the second option is how search engines collate as search engines may treat hyphens as being the same as white space, and two-word and one-word variants of the otherwise same string may be equated too -- just to get more matches in hopes of getting the best one -- which is good because we make mistakes -- but I still cannot accept the sort in Table 6)

    Best,
    C. E. Whitehead
    cewcathar@hotmail.com
    > (and notably in
    > implementations of UCA with the French locale, in database engines for
    > building their index and for supporting the « ORDER BY » clause and
    > text compare operators like >, <, >=, <=, and « BETWEEN...AND », and
    > aggregates line « MIN() » and « MAX() », and operators based on text
    > similarity such as =, !=, and « LIKE »).

    > Philippe.
     

                                                   



    This archive was generated by hypermail 2.1.5 : Tue Jul 06 2010 - 22:44:12 CDT