RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 13 2010 - 09:20:10 CDT

Next message: announcements@unicode.org: "Unicode 5.0 now in Chinese"

Previous message: Jeroen Ruigrok van der Werven: "Re: Bengali Script"
Maybe in reply to: Philippe Verdy: "UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> De : "CE Whitehead" <cewcathar@hotmail.com>
> A : verdy_p@wanadoo.fr, unicode@unicode.org
> Copie à :
> Objet : RE: UTS#10 (collation) : French backwards level 2, and word-breakers.
>
>
> Hi, I am sort of confused; so there is no way now to put some of the weights in reverse order at the secondary level while skipping word boundaries?
> Philippe Verdy's suggestion seems reasonable, in general; however I think that not reversing the weights at word boundaries at level 2 should be simply an option for French; also I do believe that there is already a way to identify word boundaries at the primary level in the DUCET but I may be wrong -- that is the characters that define word boundaries, non-spacing characters, white space, are defined.
>
> So is the point then to define all word separators -- whether in the form of white space, a mandatory line break, etc. -- with a single weight in the DUCET? (Sorry to be so confused.)

There's already a standard annex covering word boundaries, and other
boundaries : lines, sentences, default grapheme clusters (including
ZWJ and ZWNJ, and the 8 Thai/Lao prepended characters, and sequences
that include double diacritics)... plus the combining sequence
boundaries that are part of the core standard for normalizations.

Word boundaries are within the simplest boundaries to compute (at
least for alphabetic scripts (this is certainly more complex for East
Asian scritps, but the same scripts, but these boundaries are not very
useful for collation purpose).

No need to reinvent the wheel specifically for UTS#10 collation, which
already needs the default grapheme cluster boundaries, as the smallest
boundaries (spanning entirely one or more combining sequences.),
possibly extended to cover multiple default grapheme clusters in
language-spacific clusters (for M-to-1 and M-to-N weight mappings).

Yes I know also that isolated combining characters may also receive
their own collation weights, because they are not necessarily combined
within M-to-1 or M-to-N weight mappings.

But the way UCA and the DUCET is built is to make sure that the result
will be consistant within at least the default grapheme clusters,
independantly of language-specific tailorings (but I'm not sure that
the UCA algorithm addresses all the consistancy issues to make sure
that this will be true for all the default grapheme clusters,
including in tailorings, when only the combining sequence boundaries
are really secured).

UTS#10 is also helping us to define the "non-default" grapheme
clusters perceived in various languages. It is really a complement to
the existing UAX for boundaries, that goes beyond just the purpose of
sorting and can be used even without considering any collation weights
and independantly of collation levels. For example these boundaries
can be used in full text indexing, in orthographic correctors, and in
semantic analysis of encoded texts, and they may also help for
enhancing the usability of text editors, or for text
selection/extraction in browsers.

As the introduction of backwards levels in UCA was made apparently
specifically for French collation, it really forgot one aspect of
French collation: that this is only wanted within single words (the
most significant secondary differences of accents are to be found at
end of each word separately, but not at end of texts of arbitrary
length).

And even if Kenneth at Sybase thinks that this would complicate things
or slow down collation, my own experience demonstrates just the
opposite, exactly for French collation (he recognizes himself that
French UCA collation is slow, but the cause of this slowness is
because word boundaries were forgotten an algorithm that is already
considering smaller boundaries, and he also admits that a small input
buffering will be needed for correct handling of normalization,
canonical equivalence of results, and Unicode process conformance),
and it has absolutely no impact on the effective performance for
collations without backwards level (including the default
locale-neutral "root" collation directly induced from the DUCET).

-- Philippe.

Next message: announcements@unicode.org: "Unicode 5.0 now in Chinese"
Previous message: Jeroen Ruigrok van der Werven: "Re: Bengali Script"
Maybe in reply to: Philippe Verdy: "UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 13 2010 - 09:25:34 CDT