From: Andrew West (andrewcwest@gmail.com)
Date: Wed Oct 31 2007 - 05:19:40 CST
On 31/10/2007, Kenneth Whistler <kenw@sybase.com> wrote:
>
> O.k., challenge for the day:
>
> Which of the following IDS are encoded and which are not?
> Which are equal to which others?
> What do they mean?
>
> 2FF0 2FF3 4E36 6B79 706C 6534
> 2FF0 2FF3 4E36 6B79 706C 6535
> 2FF0 2FF3 4EA0 5915 706C 6534
> 2FF0 2FF3 4EA0 5915 706C 6535
> 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6534
> 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6535
> 2FF0 2FF1 4EA0 7CF9 6534
> 2FF0 2FF1 4EA0 7CF9 6535
> 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6534
> 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
> 2FF0 2FF1 4EA0 7CF8 6534
> 2FF0 2FF1 4EA0 7CF8 6535
> 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6534
> 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
> 2FF0 2FF3 4E36 4E00 7CF9 6534
> 2FF0 2FF3 4E36 4E00 7CF9 6535
> 2FF0 2FF3 4E36 4E00 7CF8 6534
> 2FF0 2FF3 4E36 4E00 7CF8 6535
According to Vunzndi's excellent IDS lookup tool
<http://www.l10n-support.com/cgi-bin/search.cgi?> only
2FF0 2FF1 4EA0 7CF8 6535 = U-22F7A
But clearly a number of the other IDS sequences you give are equivalent to this.
The glyph components <4E36 6B79 706C>, <4EA0 5915 706C> and <4E36 4E00
5915 706C> are not equivalent to the <4EA0 7CF8> and so none of the
IDS sequences with these glyph component sequences should be
considered alternate representations of U-22F7A.
U+6534 and U+6535 are non-unifiable components, so IDS sequences with
6534 should represent a different character than those sequences with
6535.
On the other hand, U+7CF8 amd U+7CF9 are unifiable glyph variants, and
therefore which one is used in the IDS sequence is not significant for
character matching purposes.
And the sequence <2FF1 4E36 4E00> is a decomposition [s.l.] of 4EA0,
and so IDS sequences with either <2FF1 4E36 4E00> or 4EA0 are
equivalent.
Therefore, in my opinion the following are alternate representations
of U-22F7A, and the other sequences you give are not correct
representations of U-22F7A (I don't think they represent encoded
characters, but I may be wrong):
2FF0 2FF1 4EA0 7CF9 6535
2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
2FF0 2FF3 4E36 4E00 7CF9 6535
2FF0 2FF3 4E36 4E00 7CF8 6535
I'm not quite sure what the point of the exercise is. We all know that
that there may be multiple ways of representing the same character
using IDS sequences, but any process that is designed to work with IDS
sequences should normalize [s.l.] sequences so that alternate
representations are treated as identical, e.g. in this example
normalize 7CF9 to 7CF8 (unifiable glyph variants), and normalize <4E36
4E00> to 4EA0 (normalize to the shortest possible sequence).
Andrew
This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 05:32:48 CST