Re: Level of Unicode support required for various languages

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 31 2007 - 13:28:15 CST

  • Next message: Andrew West: "Re: Level of Unicode support required for various languages"

    Andrew West rose to the challenge:

    > >
    > > O.k., challenge for the day:
    > >
    > > Which of the following IDS are encoded and which are not?
    > > Which are equal to which others?
    > > What do they mean?
    > >
    > > 2FF0 2FF3 4E36 6B79 706C 6534
    > > 2FF0 2FF3 4E36 6B79 706C 6535
    > > 2FF0 2FF3 4EA0 5915 706C 6534
    > > 2FF0 2FF3 4EA0 5915 706C 6535
    > > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6534
    > > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6535
    > > 2FF0 2FF1 4EA0 7CF9 6534
    > > 2FF0 2FF1 4EA0 7CF9 6535
    > > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6534
    > > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
    > > 2FF0 2FF1 4EA0 7CF8 6534
    > > 2FF0 2FF1 4EA0 7CF8 6535
    > > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6534
    > > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
    > > 2FF0 2FF3 4E36 4E00 7CF9 6534
    > > 2FF0 2FF3 4E36 4E00 7CF9 6535
    > > 2FF0 2FF3 4E36 4E00 7CF8 6534
    > > 2FF0 2FF3 4E36 4E00 7CF8 6535
    >
    > According to Vunzndi's excellent IDS lookup tool
    > <http://www.l10n-support.com/cgi-bin/search.cgi?> only
    >
    > 2FF0 2FF1 4EA0 7CF8 6535 = U-22F7A

    Correct.

    >
    > But clearly a number of the other IDS sequences you give are equivalent to this.

    Also correct.

    >
    > The glyph components <4E36 6B79 706C>, <4EA0 5915 706C> and <4E36 4E00
    > 5915 706C> are not equivalent to the <4EA0 7CF8> and so none of the
    > IDS sequences with these glyph component sequences should be
    > considered alternate representations of U-22F7A.

    I agree.

    > U+6534 and U+6535 are non-unifiable components, so IDS sequences with
    > 6534 should represent a different character than those sequences with
    > 6535.

    U+6534 and U+6535 and non-unifiable by IRG unification rules,
    but they are alternative forms of the same radical. This results
    in double encodings of what are quite arguably the same abstract
    character in a number of instances: 6571/6573, 657D/657F, 6585/6586
    and so on. And that calls into question what the intent of the
    user of IDS is when choosing one or the other, and whether the
    described characters using one or the other are semantically
    distinct. What we can tell is that given the IRG unification rules,
    and given sourced attestations of what is described by the IDS,
    IRG would recommend separate encoding in Unicode, for consistency.
    But that doesn't answer the question as to whether the described
    entities are *actually* distinct and would be better described
    as variants of the same character.

    > On the other hand, U+7CF8 amd U+7CF9 are unifiable glyph variants, and
    > therefore which one is used in the IDS sequence is not significant for
    > character matching purposes.

    I agree.

    >
    > And the sequence <2FF1 4E36 4E00> is a decomposition [s.l.] of 4EA0,
    > and so IDS sequences with either <2FF1 4E36 4E00> or 4EA0 are
    > equivalent.

    Maybe.

    >
    > Therefore, in my opinion the following are alternate representations
    > of U-22F7A, ...
    >
    > 2FF0 2FF1 4EA0 7CF9 6535
    > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
    > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
    > 2FF0 2FF3 4E36 4E00 7CF9 6535
    > 2FF0 2FF3 4E36 4E00 7CF8 6535

    > and the other sequences you give are not correct
    > representations of U-22F7A (I don't think they represent encoded
    > characters, but I may be wrong):

    At least some of them, and in particular,

    2FF0 2FF3 4E36 6B79 706C 6535

    are descriptions of a variant of the encoded character U+22F6F.

    which in the current display font uses 3 dots at the bottom
    of the left side of the character, but in other variants
    uses 4 dots (i.e. U+706C). In fact, the glyph in the charts
    for U+22F6F is very difficult to describe with an IDS,
    because there is no good component for the 3 horizontal dots,
    unless you want to resort to U+5C0F (or U+2E8C) as infelicitous
    fallbacks, or to three dots: <2FF2, 4E36, 4E36, 4E36>.

    Oh, and U+22F7A and U+22F6F are variants of each other, as well.

    And those are related to U+22F22, itself a variant of
    U+6BBA sha1 'to kill', filed under a completely different
    radical.
     
    > I'm not quite sure what the point of the exercise is.

    To demonstrate that the whole process is non-trivial -- particularly
    for the kinds of characters, especially variant forms, taboo
    forms, personal names, and the like, that one would most
    likely have to resort to IDS in order to describe. Taboo
    forms, which remove a stroke, would tend to be particularly
    problematical for a component-based description.

    > We all know that
    > that there may be multiple ways of representing the same character
    > using IDS sequences, but any process that is designed to work with IDS
    > sequences should normalize [s.l.] sequences so that alternate
    > representations are treated as identical, e.g. in this example
    > normalize 7CF9 to 7CF8 (unifiable glyph variants), and normalize <4E36
    > 4E00> to 4EA0 (normalize to the shortest possible sequence).

    Well, <4E36, 4E00> might normalize to 4EA0. But 4EA0 is written
    at least two ways -- one with a dian (as seen in the chart font)
    and one with a vertical stroke (as seen in older style fonts,
    including many commercial Japanese fonts). Sure the difference
    is stylistic and unifiable, but what if an end user of IDS is
    trying explicitly to *make* that distinction in describing a
    Han character?

    What is the shortest possible sequence for <2FF2, 4E36, 4E36, 4E36>?
    Is it U+5C0F or not?

    I'm just glad I'm not the one who has to write such a
    IDS normalization process for all of Han.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 13:30:32 CST