Re: Level of Unicode support required for various languages

From: Andrew West (andrewcwest@gmail.com)
Date: Wed Oct 31 2007 - 05:19:40 CST

  • Next message: vunzndi@vfemail.net: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"

    On 31/10/2007, Kenneth Whistler <kenw@sybase.com> wrote:
    >
    > O.k., challenge for the day:
    >
    > Which of the following IDS are encoded and which are not?
    > Which are equal to which others?
    > What do they mean?
    >
    > 2FF0 2FF3 4E36 6B79 706C 6534
    > 2FF0 2FF3 4E36 6B79 706C 6535
    > 2FF0 2FF3 4EA0 5915 706C 6534
    > 2FF0 2FF3 4EA0 5915 706C 6535
    > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6534
    > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6535
    > 2FF0 2FF1 4EA0 7CF9 6534
    > 2FF0 2FF1 4EA0 7CF9 6535
    > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6534
    > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
    > 2FF0 2FF1 4EA0 7CF8 6534
    > 2FF0 2FF1 4EA0 7CF8 6535
    > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6534
    > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
    > 2FF0 2FF3 4E36 4E00 7CF9 6534
    > 2FF0 2FF3 4E36 4E00 7CF9 6535
    > 2FF0 2FF3 4E36 4E00 7CF8 6534
    > 2FF0 2FF3 4E36 4E00 7CF8 6535

    According to Vunzndi's excellent IDS lookup tool
    <http://www.l10n-support.com/cgi-bin/search.cgi?> only

    2FF0 2FF1 4EA0 7CF8 6535 = U-22F7A

    But clearly a number of the other IDS sequences you give are equivalent to this.

    The glyph components <4E36 6B79 706C>, <4EA0 5915 706C> and <4E36 4E00
    5915 706C> are not equivalent to the <4EA0 7CF8> and so none of the
    IDS sequences with these glyph component sequences should be
    considered alternate representations of U-22F7A.

    U+6534 and U+6535 are non-unifiable components, so IDS sequences with
    6534 should represent a different character than those sequences with
    6535.

    On the other hand, U+7CF8 amd U+7CF9 are unifiable glyph variants, and
    therefore which one is used in the IDS sequence is not significant for
    character matching purposes.

    And the sequence <2FF1 4E36 4E00> is a decomposition [s.l.] of 4EA0,
    and so IDS sequences with either <2FF1 4E36 4E00> or 4EA0 are
    equivalent.

    Therefore, in my opinion the following are alternate representations
    of U-22F7A, and the other sequences you give are not correct
    representations of U-22F7A (I don't think they represent encoded
    characters, but I may be wrong):

    2FF0 2FF1 4EA0 7CF9 6535
    2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
    2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
    2FF0 2FF3 4E36 4E00 7CF9 6535
    2FF0 2FF3 4E36 4E00 7CF8 6535

    I'm not quite sure what the point of the exercise is. We all know that
    that there may be multiple ways of representing the same character
    using IDS sequences, but any process that is designed to work with IDS
    sequences should normalize [s.l.] sequences so that alternate
    representations are treated as identical, e.g. in this example
    normalize 7CF9 to 7CF8 (unifiable glyph variants), and normalize <4E36
    4E00> to 4EA0 (normalize to the shortest possible sequence).

    Andrew



    This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 05:32:48 CST