RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 15 2010 - 01:59:58 CDT

  • Next message: Arno Schmitt: "Re[2]: Arab Ma[r]ks"

    "CE Whitehead" <cewcathar@hotmail.com> wrote:
    > At the same time, as I think about it, I can think of no proper names in French where the words differed only by level 2 accents* -- although I am not French.

    The most common case is not within proper names, but within the
    various forms of conjugated verbs and their past participles, in the
    first group.

    Backwards ordering comes from French phonology which exhibits stress
    on the final pronounced syllable (ignoring additional mute 'e' and
    mute consonannts at end of words, so effectively ignoring some
    orthographic syllable boundaries). This allows dictionaries to sort
    together (and present within the same article) many grammatical
    variations of the same word, according to how they are effectively
    pronounced, something that is not used in English because its
    orthography is extremely weakly transparent, so it just uses the
    default ordering of orthographic base letters indistinctly of their
    role in the phonology.

    The language phonology is so significant within the « natural »
    ordering/collation of words, that, when the orthography is enough
    transparent, spans of letters used as digrams/trigrams are considered
    as a single letter (this case occurs in Spanish for example because
    its ortography is strongly transparent).

    Ideally, in French, we should also sort the '-er' termination of verbs
    in the same collation group as '-é', '-ée', '-ées', after the
    collation group for '-e', '-es', '-ent' ; but this is not done in
    French as there are too many exceptions (this would require semantic
    knowledge, a requirement that would be too complicate in most cases
    for a collation/sort algorithm, but something that is actually used in
    French dictionnaries, that also consider other regularly conjugated
    forms, as well as other regular variations of nouns and adjectives
    that are collated together in the same group sharing the same
    definition).

    Because the stress is on the final syllable, any difference of accents
    in this syllable takes a much higher importance in collation, than
    variations of accents on prior syllables. This is the MAJOR
    justification of why backwards ordering of level 2 differences should
    occur ONLY between word boundaries (because separate words have their
    own separate stress on important vowels that are clearly
    differenciated by their accent.

    And you should also note that backwards ordering of differences will
    be needed only for very few cases of letters with accents in French,
    that exhibit significant differences :
    - /ô$/ versus. /o$/ :
      open versus close 'o', in very few word pairs like « môle » vs. « mole ».
    - /e(s|nt)?$/ versus /( e[s]?[t]?[s]? | [èê][stz][s]? )$/' versus
    /(é[e]?[s]?)$/ :
      open versus mid-close versus closed versus mute 'e'

    A pure lexical analysis cannot completely disambiguate the phonology
    (for exemple the '-ent' finals that are mute in verbs, and nasalized
    in adverbs). And the first case above is often now insignificant today
    for many modern French speakers, as well as the difference between
    mid-closed and closed 'e', so this just leaves the case of
    differentiating the open 'é' and closed 'è|ê' (which includes also the
    unaccented 'e' in some foreign words like toponyms, or French words
    where 'e' appears before a non mute final consonnant)

    In summary, only one letter is really concerned by a significant
    differenciation by its accent, and this single letter is 'é', with its
    acute accent, which creates a major difference with all other forms of
    letter 'e'. all other French accents are not important (and the
    circumflex is already disappearing progressively in the orthography
    above 'i' and 'u' in most cases, just because it is NEVER pronounced
    by itself, and gives no additional semantic differences in written
    texts).

    When you consider this, backwards processing will only be really
    needed in a word, if there's a 'é' (E WITH ACUTE) as the last vowel
    within a word, when this is differenciating with the same word
    (ignoring case) without accents. and almost all these frequent cases
    occur with French verbs in the very wide 1st froup, and their past
    participles that may also be used separately as adjectives.

    This clearly limits the interest of backwards processing of level 2,
    given that other accents may be as well sorted in the forwards
    direction. Do you see what I mean ?

    It's that you can treat it very easily using ONLY FOREWARDS levels, by
    assigning a higher collation weight on the PRIMARY level, for the
    letter 'é' or 'É' ONLY when it occurs at end of words (possibly
    followed by 's' or 'es' before the word separation). This just
    requires a small look-ahead of at most 3 characters to detect the end
    of words.

    In summary, for French, it will just be enough to write a primary
    level tailoring like:

      /e/ < /é(e|es|s)?$/ < 'f'

    exactly as if theis final 'é' was a digram (similar to the special
    level-1 tailoring of 'ch' as a distinct base letter in Spanish) and
    where "$" means the end of a word boundary, using UAX#29 which also
    uses a foreward processing with small look-ahead length).

    And then you can safely use ONLY forewards processing for French
    collation of ALL collation levels, using the SAME and very small
    look-ahead input buffer:

    - for standard Unicode normalizations (NFC/NFD) of combining
    characters (required for Unicode process conformance).

    - for standard UAX#29 combining sequences (grouping also sequences
    joined by ZWJ and optionally terminated by ZWNJ).

    - for standard UAX#29 grapheme cluster boundaries (not necessarily the
    "extended" boundaries, so you'll not include prepending Thai/Lao
    letters)

    - for standard UAX#29 word boundaries (not necessarily the "extended"
    boundaries, so you'll not include trailing whitespaces) and only if
    collating at level 2 or more (the additionnal tailoring of level 1
    above may be dropped when collation has been limited to level 1 only).

    - for other UCA tailorings using M-to-1 or M-to-N collation mappings.

    - and finally to buffer ONLY words when comparing strings level by
    level, and only if comparing pairs of texts. This buffer will be
    cleared completely between each word boundary (for French, this means
    that the input buffer length will contain AT MOST 25 characters).

    - if you are generating a collation string from a single input text,
    this last step is avoided, but you keep the need for the two previous
    steps, and in fact for French it will be reduced to an input buffer
    for look-ahead of only 4 code UTF-16 code-units, or 4 code-points).

    Is that clearer ? In other words, backwards processing is not even
    needed for French, as we can be much smarter and produce the correct
    intended sort order without it (but ONLY is a word-break is inserted
    in the algorithm, something that will not add any significant cost.

    And this demonstrates that Kenneth at Sybase WAS CLEARLY WRONG, when
    he affirmed (without testing the solution, or without understanding
    it) that adding a UAX #29 breaker in the UCA algorithm would reduce
    the performance and would complicate a lot the UCA algorithm (which
    ALREADY needs an input look-ahead buffer, something that he
    incorrectly implied when he said that we don't need any buffer,
    assuming that this would be needed to generate sort keys before
    comparing pairs of text), when it fact it will greatly increase the
    performance for French collation, and will also greatly improve the
    data locality (thanks to the MUCH smaller input look-ahead).

    And it and will have absolutely no impact on other languages that
    don't need the detection of word boundaries (something that remains to
    demonstrate, even if the case of French has been accepted but "solved"
    using the wrong tool !).

    I want to expose the details of my solution (which is certainly highly
    preferable to the existing "backwards" trick), both to the UTC members
    and to those currently participating to the ISO technical work group
    preparing an international standard about collation.

    But I also want to give another important note about the very high
    utility of standard UAX#29 segmentations in the collation of ALL
    locales (that cannot safely ignore them completely):

    ----
    Rethink about it, but a word-breaker (or at least a grapheme cluster
    breaker if word breaks do not make differences) will also greatly
    improve the performance of other languages (including English!) for
    comparing pairs of texts (all the needed levels will be treated words
    by word, before going to the next word, when comparing pairs of
    arbitrarily long texts).
    It will also be usable as well when generating collation strings from
    an arbitrarily long text:
    Suppose that you have to generate a level-2 collation string from an
    arbitrarily long texts containing word like
       "Aaaa Bbbbbb ... Zzzzzz"
    - the word boundaries will be between «Aaaaa», « », «Bbbbbb», « », ... , «zzzzz»
    - Let's say that the word «aaaaaa» which has higher collation weights
    values «AAAAA» at the primary level, and lower collation weights
    values «aaaaa» for the secondary level.
    - Let''s say that the space separators (detected in separate UAX#29
    segments) are generating  higer collation weights values «_» at
    primary level, and lower collation weight values « » for the secondary
    level.
    - the usual UCA collation string is built like this, requiring to keep
    in buffer for the whole input text:
       "AAAAA_BBBB_..._ZZZZaaaaa bbbbb ... zzzz"
    - but you can as well build the collation string like this, where the
    bufffer just needs to keep the UAX#29 segments (words for example):
       "AAAAAaaaaa_ BBBBbbbbb_ ......_ ZZZZzzzz"
    In other words, the resulting collation string has exactly the same
    length, but you absolutely don't need an arbitrarily long input buffer
    to generate a correct collation string, when the existing UAX#29
    breaks are offering excellent opportunities to generate immediately
    the binary serialization of all levels between each boundary, and
    allows you to clear completely that input buffer from all characters
    from that segment of text (you just keep the unprocessed look-ahead
    characters that may be have read and restored back in that buffer, as
    they are part of the next segment, that why the input buffer should
    probably be circular, even if it's resizable, if ever there's a longer
    segment).
    For English, which ONLY needs the smallest « grapheme cluster »
    boundaries if there's some accents or foreign character occuring, the
    input buffer needed to generate all the  binary weights for each
    successive level will not go very frequently beyond 2 codepoints, and
    in fact for most texts it will need to keep only 1 character of
    look-ahead (it may grow slightly above only when using the DUCET for
    treating all non-Basic Latin Unicode characters, notably some rare
    Vietnamese, Hangul and Indic default grapheme clusters).
    My solution is general enough that each language can specify the level
    of UAX#29 segmentation they need to process all possible input texts,
    as a parameter within their UCA tailoring specifications, because this
    is the only thing that will really influence (along with existing
    M-to-1 and M-to-1 mappings in the tailoring rules) the effective
    buffer lengh needed for the look-ahead.
    For English, or for the ROOT locale implied by the Unicode DUCET (and
    used as a reference within the CLDR project), the default UAX#29 break
    level to be used within UCA will just be the « default grapheme
    cluster » boundaries.
    Note that the UCA algorithm ALREADY requires the detection of AT LEAST
    the « default grapheme cluster » boundaries (you should not use any
    finer level) : with a smaller segmentation level, it will not be safe
    to use the algorithm if the input text is not fully normalized, and it
    won't be safe to pass over ZWJ and ZWNJ as if they were in segments to
    be collated separately from the rest of the combining sequences.
    -- Philippe.
    


    This archive was generated by hypermail 2.1.5 : Thu Jul 15 2010 - 02:08:39 CDT