Re: Taiwan Aboriginal Languages and Unicode support

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 26 2006 - 06:52:17 CST

  • Next message: Doug Ewell: "Re: Taiwan Aboriginal Languages and Unicode support"

    From: ""Arne Götje (高盛華)"" <arne@linux.org.tw>
    >> I don't remember if there is a generic way to make a combining mark
    >> (such as an acute accent) apply to a group of two base letters (such as
    >> n g), but that is the way to solve this problem, not by encoding another
    >> precomposed combination.
    >
    > again: they are *not* two base letter but one 'nġ', where the dot gets
    > replaced with the accent. Same issue like the 'i' in European languages.
    >
    >> The analogy with dotless-i is not sound; there were numerous legacy
    >> character sets for Turkish that distinguished dotted-i from dotless-i,
    >> and Unicode had to maintain 1-to-1 convertibility with those character
    >> sets. The same situation does not apply to "ng".
    >>
    >>> 3. In Amis language the 'i' when it gets its acute, grave or
    >>> circumflex accent, it keeps the i-dot in place and the accent gets
    >>> stacked on top of the i-dot.
    >>> However, fonts handling European scripts will probably take the i-dot
    >>> away and replace it with the accent, rather than stacking the accent
    >>> on top of it.
    >>> Do we need to have a separate encoded 'i' for this different semantic
    >>> purpose? Or is there a better way to solve this issue?
    >>
    >> U+0069 U+0307 U+0301
    >> U+0069 U+0307 U+0300
    >> U+0069 U+0307 U+0302
    >>
    >> This is what Lithuanian does, IIRC.
    >
    > If it should be this way, then I propose that all software shall be
    > changed in the way, that when a base glyph has one ore more combining
    > accents, the whole sequence shall be treated as *one* character, so,
    > when deleting a combining accent all preceding characters up to the base
    > character and following combining accents, which belong to the same
    > sequence get deleted too. Otherwise text processing is a PITA. :(
    > (Cursor cannot positioned correctly when using the mouse, easy to miss
    > an combining accent when deleting another one.)
    > And make this rule a compatibility rule! If Software does not follow
    > this rule, it is not compatible to Unicode! Otherwise I don't know how
    > to convince software developers of the importance of this issue. This
    > would also be necessary for sorting algorithms. Either the accents get
    > ignored when sorting (like in Amis), or they will be sorted as separate
    > character entities, like in Paiwan.

    Searching and sorting is already described in Unicode under "collation".
    I see absolutely no diffrence here with the Spanish "ch", the Breton "c’h", where the UCA already handles this "problem", and many other digraphs found in lots of other languages!

    Editors are another "problem". It's not really a problem if "ng-dot" is entered as separate letters using multiple keystrokes; this already occurs in languages where this is completely unavoidable (Chinese written with Han is a good example!) Unicode is not there to define how input methods should be designed.

    And plain-text editors, in almost all cases, have absolutely no way to guess which language you are entering, unless there's a language mode selectable in the editor itself (in which case, the editor can already use the UCA specification to handle the language properly!), so it is normal that when you press backspace after "ng-dot" you get an "n": this is a local transformation of the plain-text, not a problem of its actual encoding. Press backspace a second time and think about all other languages that use digraphs/trigraphs...

    Please note that when entering text, the process is *interactive*; there's a user behind each keystroke! Each language has its owndifficulties, but adding more "compatibility" encodings would just severely complicate the processing (because the software would still need to handle the encoding with separate abstract characters.

    You are really mixing the concept of abstract characters with the language-specific concept of letters. Unicode is not there to define how languages must be encoded; Unicode just encode scripts, and scripts in Unicode are independant of the languages. This is a fundamental statement which has successfully avoided an explosion of the encoding into hundreds (thousands?) of language-specific "codepages" (one for each language!), and allowed a much larger coverage of languages of the world with a compact encoding. The UCA algorithm has been designed to solve language-specific issues without adding more encoding.

    If you properly manage the collation algorithm, and you properly handle "ng-dot" as a single collation element, you WON'T find a match in "ng-dot" when searching for "n".

    There's absolutely no compelling reason to encode "ng-dot" as a separate codepoint. adding such codepoint would be a damaging precedent, that wouldviolate the current rule of not adding more compatibility characters in Unicode if it breaks the canonical equivalence and Unicode stability! There are other solutions that already work!

    The case of dot-above stacked with other accents is already described too, and the solution already does work! There are also fonts that handle the stacked composition properly. So it's just a matter of software upgrade (already available!) to get this font support.



    This archive was generated by hypermail 2.1.5 : Tue Dec 26 2006 - 06:55:41 CST