Re: Taiwan Aboriginal Languages and Unicode support

From: Doug Ewell (
Date: Mon Dec 25 2006 - 22:10:19 CST

  • Next message: Arne Götje (高盛華): "Re: Taiwan Aboriginal Languages and Unicode support"

    Arne Götje (高盛華) <arne at linux dot org dot tw> wrote:

    > 1. instead of the letter 'g', they use the letter 'nġ'. This is a
    > separate letter and not a ligature. It gets sorted differently in Amis
    > and Paiwan languages and when type processing, it needs to be handled
    > as such.
    > My idea would be to encode this letter as a seperate character, as it
    > has its own semantic. We can put it probably into one of the existing
    > Latin Extensions in Unicode.

    U+006E U+0121

    or, if both n and ġ are individual letters and can appear together with
    a different semantic from the one you describe, and if collating tables
    are tailored to take CGJ into account:

    U+006E U+034F U+0121

    See the often-cited examples of "ch" in Spanish and Czech. The fact
    that two existing characters combine to make a single "letter" in an
    orthography does not justify encoding the combination as a separate
    character. Most of the existing examples where this was done in Unicode
    were to achieve some 1-to-1 convertibility goal in Unicode 1.0, and do
    not represent a precedent for future encoding.

    See also the WG2 "Principles and Procedures" document, Annex G (page

    > 2. With the character 'nġ': in Amis this character, like all others,
    > can get an acute, grave or circumflex accent. While we can use
    > combining accent sequences to produce such characters, for the 'nġ'
    > the dot on the g needs to be replaced, similar like it does on the 'i'
    > in European languages.
    > I suppose we need to encode a letter 'dotless ng' for this, like we
    > have with the 'i'.

    I don't remember if there is a generic way to make a combining mark
    (such as an acute accent) apply to a group of two base letters (such as
    n g), but that is the way to solve this problem, not by encoding another
    precomposed combination.

    The analogy with dotless-i is not sound; there were numerous legacy
    character sets for Turkish that distinguished dotted-i from dotless-i,
    and Unicode had to maintain 1-to-1 convertibility with those character
    sets. The same situation does not apply to "ng".

    > 3. In Amis language the 'i' when it gets its acute, grave or
    > circumflex accent, it keeps the i-dot in place and the accent gets
    > stacked on top of the i-dot.
    > However, fonts handling European scripts will probably take the i-dot
    > away and replace it with the accent, rather than stacking the accent
    > on top of it.
    > Do we need to have a separate encoded 'i' for this different semantic
    > purpose? Or is there a better way to solve this issue?

    U+0069 U+0307 U+0301
    U+0069 U+0307 U+0300
    U+0069 U+0307 U+0302

    This is what Lithuanian does, IIRC.

    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Mon Dec 25 2006 - 22:13:12 CST