Re: Level of Unicode support required for various languages

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 31 2007 - 13:28:15 CST

Next message: Andrew West: "Re: Level of Unicode support required for various languages"

Previous message: Andrew West: "Re: Level of Unicode support required for various languages"
Maybe in reply to: Timothy Armes: "Level of Unicode support required for various languages"
Next in thread: Andrew West: "Re: Level of Unicode support required for various languages"
Reply: Andrew West: "Re: Level of Unicode support required for various languages"
Reply: Andrew West: "Re: Level of Unicode support required for various languages"
Reply: vunzndi@vfemail.net: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Andrew West rose to the challenge:

> >
> > O.k., challenge for the day:
> >
> > Which of the following IDS are encoded and which are not?
> > Which are equal to which others?
> > What do they mean?
> >
> > 2FF0 2FF3 4E36 6B79 706C 6534
> > 2FF0 2FF3 4E36 6B79 706C 6535
> > 2FF0 2FF3 4EA0 5915 706C 6534
> > 2FF0 2FF3 4EA0 5915 706C 6535
> > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6534
> > 2FF0 2FF1 2FF3 4E36 4E00 5915 706C 6535
> > 2FF0 2FF1 4EA0 7CF9 6534
> > 2FF0 2FF1 4EA0 7CF9 6535
> > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6534
> > 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
> > 2FF0 2FF1 4EA0 7CF8 6534
> > 2FF0 2FF1 4EA0 7CF8 6535
> > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6534
> > 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
> > 2FF0 2FF3 4E36 4E00 7CF9 6534
> > 2FF0 2FF3 4E36 4E00 7CF9 6535
> > 2FF0 2FF3 4E36 4E00 7CF8 6534
> > 2FF0 2FF3 4E36 4E00 7CF8 6535
>
> According to Vunzndi's excellent IDS lookup tool
> <http://www.l10n-support.com/cgi-bin/search.cgi?> only
>
> 2FF0 2FF1 4EA0 7CF8 6535 = U-22F7A

Correct.

>
> But clearly a number of the other IDS sequences you give are equivalent to this.

Also correct.

>
> The glyph components <4E36 6B79 706C>, <4EA0 5915 706C> and <4E36 4E00
> 5915 706C> are not equivalent to the <4EA0 7CF8> and so none of the
> IDS sequences with these glyph component sequences should be
> considered alternate representations of U-22F7A.

I agree.

> U+6534 and U+6535 are non-unifiable components, so IDS sequences with
> 6534 should represent a different character than those sequences with
> 6535.

U+6534 and U+6535 and non-unifiable by IRG unification rules,
but they are alternative forms of the same radical. This results
in double encodings of what are quite arguably the same abstract
character in a number of instances: 6571/6573, 657D/657F, 6585/6586
and so on. And that calls into question what the intent of the
user of IDS is when choosing one or the other, and whether the
described characters using one or the other are semantically
distinct. What we can tell is that given the IRG unification rules,
and given sourced attestations of what is described by the IDS,
IRG would recommend separate encoding in Unicode, for consistency.
But that doesn't answer the question as to whether the described
entities are *actually* distinct and would be better described
as variants of the same character.

> On the other hand, U+7CF8 amd U+7CF9 are unifiable glyph variants, and
> therefore which one is used in the IDS sequence is not significant for
> character matching purposes.

I agree.

>
> And the sequence <2FF1 4E36 4E00> is a decomposition [s.l.] of 4EA0,
> and so IDS sequences with either <2FF1 4E36 4E00> or 4EA0 are
> equivalent.

Maybe.

>
> Therefore, in my opinion the following are alternate representations
> of U-22F7A, ...
>
> 2FF0 2FF1 4EA0 7CF9 6535
> 2FF0 2FF1 2FF1 4E36 4E00 7CF9 6535
> 2FF0 2FF1 2FF1 4E36 4E00 7CF8 6535
> 2FF0 2FF3 4E36 4E00 7CF9 6535
> 2FF0 2FF3 4E36 4E00 7CF8 6535

> and the other sequences you give are not correct
> representations of U-22F7A (I don't think they represent encoded
> characters, but I may be wrong):

At least some of them, and in particular,

2FF0 2FF3 4E36 6B79 706C 6535

are descriptions of a variant of the encoded character U+22F6F.

which in the current display font uses 3 dots at the bottom
of the left side of the character, but in other variants
uses 4 dots (i.e. U+706C). In fact, the glyph in the charts
for U+22F6F is very difficult to describe with an IDS,
because there is no good component for the 3 horizontal dots,
unless you want to resort to U+5C0F (or U+2E8C) as infelicitous
fallbacks, or to three dots: <2FF2, 4E36, 4E36, 4E36>.

Oh, and U+22F7A and U+22F6F are variants of each other, as well.

And those are related to U+22F22, itself a variant of
U+6BBA sha1 'to kill', filed under a completely different
radical.

> I'm not quite sure what the point of the exercise is.

To demonstrate that the whole process is non-trivial -- particularly
for the kinds of characters, especially variant forms, taboo
forms, personal names, and the like, that one would most
likely have to resort to IDS in order to describe. Taboo
forms, which remove a stroke, would tend to be particularly
problematical for a component-based description.

> We all know that
> that there may be multiple ways of representing the same character
> using IDS sequences, but any process that is designed to work with IDS
> sequences should normalize [s.l.] sequences so that alternate
> representations are treated as identical, e.g. in this example
> normalize 7CF9 to 7CF8 (unifiable glyph variants), and normalize <4E36
> 4E00> to 4EA0 (normalize to the shortest possible sequence).

Well, <4E36, 4E00> might normalize to 4EA0. But 4EA0 is written
at least two ways -- one with a dian (as seen in the chart font)
and one with a vertical stroke (as seen in older style fonts,
including many commercial Japanese fonts). Sure the difference
is stylistic and unifiable, but what if an end user of IDS is
trying explicitly to *make* that distinction in describing a
Han character?

What is the shortest possible sequence for <2FF2, 4E36, 4E36, 4E36>?
Is it U+5C0F or not?

I'm just glad I'm not the one who has to write such a
IDS normalization process for all of Han.

--Ken

Next message: Andrew West: "Re: Level of Unicode support required for various languages"
Previous message: Andrew West: "Re: Level of Unicode support required for various languages"
Maybe in reply to: Timothy Armes: "Level of Unicode support required for various languages"
Next in thread: Andrew West: "Re: Level of Unicode support required for various languages"
Reply: Andrew West: "Re: Level of Unicode support required for various languages"
Reply: Andrew West: "Re: Level of Unicode support required for various languages"
Reply: vunzndi@vfemail.net: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 13:30:32 CST