RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 15 2010 - 01:59:58 CDT

Next message: Arno Schmitt: "Re[2]: Arab Ma[r]ks"

Previous message: Kenneth Whistler: "Re: Arab Ma[r]ks"
Maybe in reply to: Philippe Verdy: "UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: Kenneth Whistler: "Re: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"CE Whitehead" <cewcathar@hotmail.com> wrote:
> At the same time, as I think about it, I can think of no proper names in French where the words differed only by level 2 accents* -- although I am not French.

The most common case is not within proper names, but within the
various forms of conjugated verbs and their past participles, in the
first group.

Backwards ordering comes from French phonology which exhibits stress
on the final pronounced syllable (ignoring additional mute 'e' and
mute consonannts at end of words, so effectively ignoring some
orthographic syllable boundaries). This allows dictionaries to sort
together (and present within the same article) many grammatical
variations of the same word, according to how they are effectively
pronounced, something that is not used in English because its
orthography is extremely weakly transparent, so it just uses the
default ordering of orthographic base letters indistinctly of their
role in the phonology.

The language phonology is so significant within the « natural »
ordering/collation of words, that, when the orthography is enough
transparent, spans of letters used as digrams/trigrams are considered
as a single letter (this case occurs in Spanish for example because
its ortography is strongly transparent).

Ideally, in French, we should also sort the '-er' termination of verbs
in the same collation group as '-é', '-ée', '-ées', after the
collation group for '-e', '-es', '-ent' ; but this is not done in
French as there are too many exceptions (this would require semantic
knowledge, a requirement that would be too complicate in most cases
for a collation/sort algorithm, but something that is actually used in
French dictionnaries, that also consider other regularly conjugated
forms, as well as other regular variations of nouns and adjectives
that are collated together in the same group sharing the same
definition).

Because the stress is on the final syllable, any difference of accents
in this syllable takes a much higher importance in collation, than
variations of accents on prior syllables. This is the MAJOR
justification of why backwards ordering of level 2 differences should
occur ONLY between word boundaries (because separate words have their
own separate stress on important vowels that are clearly
differenciated by their accent.

And you should also note that backwards ordering of differences will
be needed only for very few cases of letters with accents in French,
that exhibit significant differences :
- /ô$/ versus. /o$/ :
open versus close 'o', in very few word pairs like « môle » vs. « mole ».
- /e(s|nt)?$/ versus /( e[s]?[t]?[s]? | [èê][stz][s]? )$/' versus
/(é[e]?[s]?)$/ :
open versus mid-close versus closed versus mute 'e'

A pure lexical analysis cannot completely disambiguate the phonology
(for exemple the '-ent' finals that are mute in verbs, and nasalized
in adverbs). And the first case above is often now insignificant today
for many modern French speakers, as well as the difference between
mid-closed and closed 'e', so this just leaves the case of
differentiating the open 'é' and closed 'è|ê' (which includes also the
unaccented 'e' in some foreign words like toponyms, or French words
where 'e' appears before a non mute final consonnant)

In summary, only one letter is really concerned by a significant
differenciation by its accent, and this single letter is 'é', with its
acute accent, which creates a major difference with all other forms of
letter 'e'. all other French accents are not important (and the
circumflex is already disappearing progressively in the orthography
above 'i' and 'u' in most cases, just because it is NEVER pronounced
by itself, and gives no additional semantic differences in written
texts).

When you consider this, backwards processing will only be really
needed in a word, if there's a 'é' (E WITH ACUTE) as the last vowel
within a word, when this is differenciating with the same word
(ignoring case) without accents. and almost all these frequent cases
occur with French verbs in the very wide 1st froup, and their past
participles that may also be used separately as adjectives.

This clearly limits the interest of backwards processing of level 2,
given that other accents may be as well sorted in the forwards
direction. Do you see what I mean ?

It's that you can treat it very easily using ONLY FOREWARDS levels, by
assigning a higher collation weight on the PRIMARY level, for the
letter 'é' or 'É' ONLY when it occurs at end of words (possibly
followed by 's' or 'es' before the word separation). This just
requires a small look-ahead of at most 3 characters to detect the end
of words.

In summary, for French, it will just be enough to write a primary
level tailoring like:

/e/ < /é(e|es|s)?$/ < 'f'

exactly as if theis final 'é' was a digram (similar to the special
level-1 tailoring of 'ch' as a distinct base letter in Spanish) and
where "$" means the end of a word boundary, using UAX#29 which also
uses a foreward processing with small look-ahead length).

And then you can safely use ONLY forewards processing for French
collation of ALL collation levels, using the SAME and very small
look-ahead input buffer:

- for standard Unicode normalizations (NFC/NFD) of combining
characters (required for Unicode process conformance).

- for standard UAX#29 combining sequences (grouping also sequences
joined by ZWJ and optionally terminated by ZWNJ).

- for standard UAX#29 grapheme cluster boundaries (not necessarily the
"extended" boundaries, so you'll not include prepending Thai/Lao
letters)

- for standard UAX#29 word boundaries (not necessarily the "extended"
boundaries, so you'll not include trailing whitespaces) and only if
collating at level 2 or more (the additionnal tailoring of level 1
above may be dropped when collation has been limited to level 1 only).

- for other UCA tailorings using M-to-1 or M-to-N collation mappings.

- and finally to buffer ONLY words when comparing strings level by
level, and only if comparing pairs of texts. This buffer will be
cleared completely between each word boundary (for French, this means
that the input buffer length will contain AT MOST 25 characters).

- if you are generating a collation string from a single input text,
this last step is avoided, but you keep the need for the two previous
steps, and in fact for French it will be reduced to an input buffer
for look-ahead of only 4 code UTF-16 code-units, or 4 code-points).

Is that clearer ? In other words, backwards processing is not even
needed for French, as we can be much smarter and produce the correct
intended sort order without it (but ONLY is a word-break is inserted
in the algorithm, something that will not add any significant cost.

And this demonstrates that Kenneth at Sybase WAS CLEARLY WRONG, when
he affirmed (without testing the solution, or without understanding
it) that adding a UAX #29 breaker in the UCA algorithm would reduce
the performance and would complicate a lot the UCA algorithm (which
ALREADY needs an input look-ahead buffer, something that he
incorrectly implied when he said that we don't need any buffer,
assuming that this would be needed to generate sort keys before
comparing pairs of text), when it fact it will greatly increase the
performance for French collation, and will also greatly improve the
data locality (thanks to the MUCH smaller input look-ahead).

And it and will have absolutely no impact on other languages that
don't need the detection of word boundaries (something that remains to
demonstrate, even if the case of French has been accepted but "solved"
using the wrong tool !).

I want to expose the details of my solution (which is certainly highly
preferable to the existing "backwards" trick), both to the UTC members
and to those currently participating to the ISO technical work group
preparing an international standard about collation.

But I also want to give another important note about the very high
utility of standard UAX#29 segmentations in the collation of ALL
locales (that cannot safely ignore them completely):

----
Rethink about it, but a word-breaker (or at least a grapheme cluster
breaker if word breaks do not make differences) will also greatly
improve the performance of other languages (including English!) for
comparing pairs of texts (all the needed levels will be treated words
by word, before going to the next word, when comparing pairs of
arbitrarily long texts).
It will also be usable as well when generating collation strings from
an arbitrarily long text:
Suppose that you have to generate a level-2 collation string from an
arbitrarily long texts containing word like
   "Aaaa Bbbbbb ... Zzzzzz"
- the word boundaries will be between «Aaaaa», « », «Bbbbbb», « », ... , «zzzzz»
- Let's say that the word «aaaaaa» which has higher collation weights
values «AAAAA» at the primary level, and lower collation weights
values «aaaaa» for the secondary level.
- Let''s say that the space separators (detected in separate UAX#29
segments) are generating  higer collation weights values «_» at
primary level, and lower collation weight values « » for the secondary
level.
- the usual UCA collation string is built like this, requiring to keep
in buffer for the whole input text:
   "AAAAA_BBBB_..._ZZZZaaaaa bbbbb ... zzzz"
- but you can as well build the collation string like this, where the
bufffer just needs to keep the UAX#29 segments (words for example):
   "AAAAAaaaaa_ BBBBbbbbb_ ......_ ZZZZzzzz"
In other words, the resulting collation string has exactly the same
length, but you absolutely don't need an arbitrarily long input buffer
to generate a correct collation string, when the existing UAX#29
breaks are offering excellent opportunities to generate immediately
the binary serialization of all levels between each boundary, and
allows you to clear completely that input buffer from all characters
from that segment of text (you just keep the unprocessed look-ahead
characters that may be have read and restored back in that buffer, as
they are part of the next segment, that why the input buffer should
probably be circular, even if it's resizable, if ever there's a longer
segment).
For English, which ONLY needs the smallest « grapheme cluster »
boundaries if there's some accents or foreign character occuring, the
input buffer needed to generate all the  binary weights for each
successive level will not go very frequently beyond 2 codepoints, and
in fact for most texts it will need to keep only 1 character of
look-ahead (it may grow slightly above only when using the DUCET for
treating all non-Basic Latin Unicode characters, notably some rare
Vietnamese, Hangul and Indic default grapheme clusters).
My solution is general enough that each language can specify the level
of UAX#29 segmentation they need to process all possible input texts,
as a parameter within their UCA tailoring specifications, because this
is the only thing that will really influence (along with existing
M-to-1 and M-to-1 mappings in the tailoring rules) the effective
buffer lengh needed for the look-ahead.
For English, or for the ROOT locale implied by the Unicode DUCET (and
used as a reference within the CLDR project), the default UAX#29 break
level to be used within UCA will just be the « default grapheme
cluster » boundaries.
Note that the UCA algorithm ALREADY requires the detection of AT LEAST
the « default grapheme cluster » boundaries (you should not use any
finer level) : with a smaller segmentation level, it will not be safe
to use the algorithm if the input text is not fully normalized, and it
won't be safe to pass over ZWJ and ZWNJ as if they were in segments to
be collated separately from the rest of the combining sequences.
-- Philippe.

Next message: Arno Schmitt: "Re[2]: Arab Ma[r]ks"
Previous message: Kenneth Whistler: "Re: Arab Ma[r]ks"
Maybe in reply to: Philippe Verdy: "UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: Kenneth Whistler: "Re: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jul 15 2010 - 02:08:39 CDT