Re: Level of Unicode support required for various languages

From: vunzndi@vfemail.net
Date: Wed Oct 31 2007 - 18:12:40 CST

  • Next message: vunzndi@vfemail.net: "Re: Level of Unicode support required for various languages"

    Quoting Kenneth Whistler <kenw@sybase.com>:

    > Andrew West rose to the challenge:
    >
    >> >
    >> > O.k., challenge for the day:
    >> >
    >> > Which of the following IDS are encoded and which are not?
    >> > Which are equal to which others?
    >> > What do they mean?
    >> >
    >> > 2FF0 2FF3 4E36 6B79 706C 6534
    >> > 2FF0 2FF3 4E36 6B79 706C 6535
    >> > 2FF0 2FF3 4EA0 5915 706C 6534
    >> > 2FF0 2FF3 4EA0 5915 706C 6535
    >> > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6534
    >> > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6535
    >> > 2FF0 2FF1 4EA0 7CF9 6534
    >> > 2FF0 2FF1 4EA0 7CF9 6535
    >> > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6534
    >> > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
    >> > 2FF0 2FF1 4EA0 7CF8 6534
    >> > 2FF0 2FF1 4EA0 7CF8 6535
    >> > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6534
    >> > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
    >> > 2FF0 2FF3 4E36 4E00 7CF9 6534
    >> > 2FF0 2FF3 4E36 4E00 7CF9 6535
    >> > 2FF0 2FF3 4E36 4E00 7CF8 6534
    >> > 2FF0 2FF3 4E36 4E00 7CF8 6535
    >>
    >> According to Vunzndi's excellent IDS lookup tool
    >> <http://www.l10n-support.com/cgi-bin/search.cgi?> only
    >>
    >> 2FF0 2FF1 4EA0 7CF8 6535 = U-22F7A
    >
    > Correct.
    >
    >>
    >> But clearly a number of the other IDS sequences you give are
    >> equivalent to this.
    >
    > Also correct.
    >
    >>
    >> The glyph components <4E36 6B79 706C>, <4EA0 5915 706C> and <4E36 4E00
    >> 5915 706C> are not equivalent to the <4EA0 7CF8> and so none of the
    >> IDS sequences with these glyph component sequences should be
    >> considered alternate representations of U-22F7A.
    >
    > I agree.
    >
    >> U+6534 and U+6535 are non-unifiable components, so IDS sequences with
    >> 6534 should represent a different character than those sequences with
    >> 6535.
    >
    > U+6534 and U+6535 and non-unifiable by IRG unification rules,
    > but they are alternative forms of the same radical. This results
    > in double encodings of what are quite arguably the same abstract
    > character in a number of instances: 6571/6573, 657D/657F, 6585/6586
    > and so on. And that calls into question what the intent of the
    > user of IDS is when choosing one or the other, and whether the
    > described characters using one or the other are semantically
    > distinct. What we can tell is that given the IRG unification rules,
    > and given sourced attestations of what is described by the IDS,
    > IRG would recommend separate encoding in Unicode, for consistency.
    > But that doesn't answer the question as to whether the described
    > entities are *actually* distinct and would be better described
    > as variants of the same character.
    >

    This is going into a different area, namely that of the cognate model
    vs cognate-abstract shape model used bby unicode. It would have be
    possible to construct a cognate only based model assuming one had
    enough data about characters' usage, however IMHO thhis would only
    work well based on different languages rather than the present CJKV.
    It would take even more work than the presnet system to decide on new
    characters.

    >
    > I'm just glad I'm not the one who has to write such a
    > IDS normalization process for all of Han.
    >

    At present I am working on the question when is the IDS reliable. 100%
    is not possible using IDS.

    > --Ken
    >
    >
    >
    >

    -------------------------------------------------
    This message sent through Virus Free Email
    http://www.vfemail.net



    This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 18:46:51 CST