Re: accented Latin characters sort order, non-language dependant

From: Philippe Verdy (
Date: Mon Jul 10 2006 - 15:05:40 CDT

  • Next message: Philippe Verdy: "Re: accented Latin characters sort order, non-language dependant"

    From: "Jukka K. Korpela" <>
    >> When you have only "a few accented characters" you don't need
    >> to bother whether "â" or "ă" comes first.
    > Since they must inevitably appear in _some_ order and since there is no
    > compelling reason to put them in a particular order, there is little
    > reason not to use the order defined by the Unicode Collating Algorithm.

    if you consider Vietnamese vowels, the accents are of two types: those that change the phonem, those that change the tone. It's clear that phonems need to be grouped together, and that tones are ordered, and this does not match the order specified in the DUCET. This is a compelling reason why the order in DUCET may need to be tailored.

    Also, there are traditional ordering in vocal alphabets that may also have an origin from another traditional script; the vowel order and consonnant order is important for education as it helps understanding how the language is spelled, especially for languages that are written using a spelling that really attempts to reproduce the spoken phonology.

    Once you realize that letters can be ordered as phonems, you immediately realize that the Latin alphabet is not enough for many languages, and that some phonems are written with additional diacritics, so the spelled order will need to be preserved too within diacritrics, as much as possible to preserve the tradition (and help education).

    In some languages, a letter with a diacritic will even be considered completely distinct from the letter without diacritic, only to preserve the oral tradition (or the tradition coming from another historic script), meaning that diacritics may be ordered differently depending on the base letter to which it applies (for example depending on the vocalic or consonnantal status of base letters).

    And in more complex cases, the ordering of letters will depend on their context, which changes its interpretation (notably in digraphs and trigraphs used to represent additional phonems or to mark historical transcriptions from another script or language, not considered as load words because they have a long tradition of use in the current language). Whever the orthographic, phonetic or grammatical rules influence how the text is spelled, is however independant of the perception of letters in languages; for languages that have a strong litterary tradition, the spelling often preserves historic forms that are quite far from the modern phonetic, so many complex orthographic exceptions occur, and written letters tend to be perceived differently from the perceived oral phonems and tones, reducing the number of letters considered for sort orders (this happened in French and English, but the example of Spanish 'ch' is a case where written digraphs are still perceived as one letter, sorted isolately.)

    Consider also the case of Nordic languages: letters with diacritics are often sorted at end of the alphabet, after letters without accents, because they are considered plain letters: where will you place the other combinations using a base letter and a diacritic not part of those combined letters at end of the alphabet? There's no general solution, and the default Unicode order can't be the best solution for all cases!

    This archive was generated by hypermail 2.1.5 : Mon Jul 10 2006 - 20:12:47 CDT