Re: Taiwan Aboriginal Languages and Unicode support

From: Doug Ewell (
Date: Tue Dec 26 2006 - 08:02:33 CST

  • Next message: Werner LEMBERG: "Re: U+3401"

    Arne Götje (高盛華) <arne at linux dot org dot tw> wrote:

    >> See the often-cited examples of "ch" in Spanish and Czech. The fact
    >> that two existing characters combine to make a single "letter" in an
    >> orthography does not justify encoding the combination as a separate
    >> character. Most of the existing examples where this was done in
    >> Unicode were to achieve some 1-to-1 convertibility goal in Unicode
    >> 1.0, and do not represent a precedent for future encoding.
    > no, this is not the same. the 'ġ' letter does not exist in the
    > alphabet, but 'nġ' is a separate letter an has to be treated as such.
    > For example: when searching for 'n' in a document it is *not*
    > appropriate that 'nġ' shows up.
    > Also when typing and deleting the 'nġ' letter, it has to be removed as
    > a whole.
    > For sorting issues: it is *not* appropriate for 'nġ' to be sorted
    > after 'n'. See the links I posted earlier.
    > So, this is clearly *not* a combination of two existing letters, but a
    > letter on its own.

    You and I are both correct: the *letter* "nġ" in Amis and Paiwan
    consists of the two *Unicode characters* U+006E and U+0121. There is
    not necessarily a 1-to-1 correspondence between "Unicode characters" and
    "letters in the alphabet used by a particular language."

    All of the issues you described that involve searching, sorting, and
    user interface can be implemented without encoding "nġ" as a separate

    > again: they are *not* two base letter but one 'nġ', where the dot gets
    > replaced with the accent. Same issue like the 'i' in European
    > languages.

    How do users in Amis and Paiwan type this letter on a typewriter or
    computer keyboard?

    >> This is what Lithuanian does, IIRC.
    > If it should be this way, then I propose that all software shall be
    > changed in the way, that when a base glyph has one ore more combining
    > accents, the whole sequence shall be treated as *one* character, so,
    > when deleting a combining accent all preceding characters up to the
    > base character and following combining accents, which belong to the
    > same sequence get deleted too.

    That is already how proper Unicode-enabled software is supposed to work.

    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Tue Dec 26 2006 - 08:04:40 CST