Re: character groupings in various languages

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 19 2003 - 00:25:53 EDT

  • Next message: Andrew C. West: "Re: Decimal separator with more than one character?"

    From: "Ben Dougall" <bend@freenet.co.uk>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    Sent: Monday, May 19, 2003 12:18 AM
    Subject: Re: character groupings in various languages

    > On Saturday, May 17, 2003, at 02:47 pm, Philippe Verdy wrote:
    > > In my opinion, what you are looking for already exists (partly) in
    > > Unicode: this is exactly the definition of character properties.
    >
    > yup, character properties. Kenneth Whistler has pointed me towards
    > those, and the ucd - unicode character database.

    Don't look only in the UCD, because Unicode also published other references, some of which are normative, other are informative.
    For example the UCA algorithm is normative, and fixes the way by which A compliantcollation can be done, but it does not explicitly mandate a particular order (as it clearly allows many places for tailoring, and in some case does not describe or fix any fixed ordering for some scripts, notably Han).

    So the "allkeys.txt" file linked in the UCA reference is mostly informative: none of its weights are normative (either in their absolute or relative values, and not even in the distinction of weight levels). It was published as a framework to allow interchange of locale informations built from small sets of tailoring rules based on the default collation table (DUCET), instead of needing to define complete sets of weights for all Unicode characters for each locale.

    So the UCA reference only provides an extensive set of rules that may be needed to create correct collation orders that match with the Unicode character model and text encoding model. All these are needed for any compliant collator, but not necesssarily enough.

    Also it is true that you are seeking for categories that are (strictly) not related to ordering (groups do not need to designate a linear classification, or a hierarchical one, and may better be represented as a multidimensional classification where each dimension is a property, either one already described in Unicode normative or informative references, or your own ones).

    Just an example: the classification of vowels/consonnants only work for alphabetized scripts, but not for syllabaries, or even for lettered scripts like Hebrew, Arabic or Brahmic scripts, because:
    1) in most cases vowels are considered as modifiers, like diacritics
    2) this classification is traditionally ignored even in mostly all alphaetized languages that have a strong cultural preference for a traditional order
    3) the distinction of vowels/semivowels/consonnants in alphabetized scripts (Latin, Greek, Cyrillic...) is not clear as it greatly depends on the language-specific usage of that script
    4) transliteration of scripts (or even between languages scharing the same "script" as defined by Unicode) can in many cases translate what is generally considered as a vowel into another letter generally considered as a consonnant (for example R and L letters in the Latin script are vowel modifiers in Brahmic scripts, or Y can be considered as a consonnant and H as a vowel...)

    So grouping characters has its own caveats, and requires tagging the text with language information, which is NOT encoded and normalized within Unicode. Unicode only defines "scripts" which sometimes do not match exactly with the definition of a script for a particular language; in most cases, the language-specific scripts are much more restrictive, and a letter like "Thorn" or "Latin Alpha" will often be considered foreign from the "Latin" script used in a particular language which do not use it, and has no tradition to even allow reading it correctly (this includes also diacritics, like the caron that many people are unable to interpret out of the original language, or the diaeresis whose role varies greatly between languages).

    Finally some languages use letters found in script only as a way to write it, but do not consider these Unicode letters as letters (for example "ch" in Spanish or "c'h" in Breton is clearly considered as a SINGLE letter whose glyph simply ressemble to the glyphs associated to letters used to write those languages, and this "grouping" of letters is very often reflected in their collation order).



    This archive was generated by hypermail 2.1.5 : Mon May 19 2003 - 01:09:23 EDT