Re: Level of Unicode support required for various languages

From: Andrew West (andrewcwest@gmail.com)
Date: Wed Oct 31 2007 - 14:10:10 CST

  • Next message: Jeroen Ruigrok van der Werven: "Stix beta fonts released"

    On 31/10/2007, Kenneth Whistler <kenw@sybase.com> wrote:
    >
    > > U+6534 and U+6535 are non-unifiable components, so IDS sequences with
    > > 6534 should represent a different character than those sequences with
    > > 6535.
    >
    > U+6534 and U+6535 and non-unifiable by IRG unification rules,
    > but they are alternative forms of the same radical. This results
    > in double encodings of what are quite arguably the same abstract
    > character in a number of instances: 6571/6573, 657D/657F, 6585/6586
    > and so on.

    I agree.

    > And that calls into question what the intent of the
    > user of IDS is when choosing one or the other, and whether the
    > described characters using one or the other are semantically
    > distinct. What we can tell is that given the IRG unification rules,
    > and given sourced attestations of what is described by the IDS,
    > IRG would recommend separate encoding in Unicode, for consistency.
    > But that doesn't answer the question as to whether the described
    > entities are *actually* distinct and would be better described
    > as variants of the same character.

    It all depends on your perspective, and you can't guarantee that the
    recipient of an IDS sequence is going to share the same perspective as
    its creator :-(

    > At least some of them, and in particular,
    >
    > 2FF0 2FF3 4E36 6B79 706C 6535
    >
    > are descriptions of a variant of the encoded character U+22F6F.

    This is a good example, which shows the weakness of the IDS system for
    matching abstract characters rather than just matching the same or
    unifiable glyphforms.

    > which in the current display font uses 3 dots at the bottom
    > of the left side of the character, but in other variants
    > uses 4 dots (i.e. U+706C). In fact, the glyph in the charts
    > for U+22F6F is very difficult to describe with an IDS,
    > because there is no good component for the 3 horizontal dots,
    > unless you want to resort to U+5C0F (or U+2E8C) as infelicitous
    > fallbacks, or to three dots: <2FF2, 4E36, 4E36, 4E36>.

    Indeed. Kawabata uses

    <2FF0 2FF3 4EA0 5915 &CDP-885E; 6535>

    > Oh, and U+22F7A and U+22F6F are variants of each other, as well.

    Which I assume are both variants of U+715E 煞 sha1

    > And those are related to U+22F22, itself a variant of
    > U+6BBA sha1 'to kill', filed under a completely different
    > radical.

    > > I'm not quite sure what the point of the exercise is.
    >
    > To demonstrate that the whole process is non-trivial -- particularly
    > for the kinds of characters, especially variant forms, taboo
    > forms, personal names, and the like, that one would most
    > likely have to resort to IDS in order to describe. Taboo
    > forms, which remove a stroke, would tend to be particularly
    > problematical for a component-based description.

    Yes, indeed. Which is why some have called for an IDC "subtraction
    operator", so that for example U+4E4C 乌 could be described as 鸟[-]丶
    <9E1F - 4E36>. However this could be ambiguous (which dot is to be
    subtracted, the top one or the one in the middle ?).

    Andrew



    This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 14:12:30 CST