Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: vunzndi@vfemail.net
Date: Mon Oct 29 2007 - 06:17:59 CST

Next message: Philippe Verdy: "RE: thorn vs. y or th, eth and other similar letters/signs"

Previous message: Richard Ishida: "Version 6 of Unicode Converter now available"
In reply to: John H. Jenkins: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Next in thread: Jeroen Ruigrok van der Werven: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Quoting "John H. Jenkins" <jenkins@apple.com>:

> There are actually two different mechanisms incorporated into Unicode
> to allow some form of representation of unencoded ideographs. The
> first is the Ideographic Variation Indicator (U+303E), and the other is
> the Ideographic Description Sequence mechanism. Both of these are
> relatively crude graphically, although using IDSs you could probably
> come up with a reasonable visual representation of the shape intended
> most of the time. They are, however, ideal for embedding in text.
>
> There is also the CDL mechanism being worked on by Wenlin. This is
> XML-based and so is not really appropriate for embedding in plane text,
> but it is also capable of showing considerably greater flexibility in
> providing a precise visual representation of the intended shape.
>
> On the whole, however, the user community currently favors strongly the
> one ideograph-one Unicode character approach.
>

The strongest advocate for precomposed characters is China, but then
China would also prefere to have precomposed Tibetan.

As to the average end user there only concern is what works, the end
user who wishes to type presently encoded characters would
probably not notice a difference, the end user who wants to type
presently unencoded characters which are just simple a combination of
already encoded characters would immediately notice an improvement.

> The fundamental problem with a component-based approach to *encoding*
> (as opposed to representation) is the ambiguity involved. It is
> frequently possible to break down a character in more than one way. A
> simple example of this is the common character U+7AE0 (?), which could
> be represented using IDSs either as ???, ???, or ???? (plus other
> possibilities caused by compatibility ideographs and encoded radicals).
> Trying to define a normalization for IDSs and allow for multiple
> spellings in searching or sorting would be a monumental task; this is
> one of the main reasons why component-based systems have never really
> gained momentum as a way to formally encoded unencoded characters.
>

Normalising the IDS of a character like U+7AE0 isn't that difficult,
about 20 lines of clumsly written perl script and a good knowledge of
polish nontation algebra would be enough (I have used this approach),
though of course everyone would need to normalise in the same way. The
heart of the problem though is that IDC just give an approximation,
which if normalised would produce a accurate result of about 80% of
characters, includingU+7AE0. For the remaining 20% one needs something
more than IDC/IDS. Though a monumental task much of the work has been
done with the work on using IDS to check for duplicates.

The technical diffiulties of searching for composite characters etc
are on a par with the many scripts in unicode that use composite
characters, a pain to program but doable.

When I have talked with Chinese publishers about IT difficulties the
most common issue raised by far is how to add characters, the number
of which would be reduced to almost zero if a composite model than
precompossed model was used.

Stabilty rules about cannonical equivalance may well be the biggest obstacle.

Yours sincerely
John Knightley

> =====
> John H. Jenkins
> jenkins@apple.com

-------------------------------------------------
This message sent through Virus Free Email
http://www.vfemail.net

Next message: Philippe Verdy: "RE: thorn vs. y or th, eth and other similar letters/signs"
Previous message: Richard Ishida: "Version 6 of Unicode Converter now available"
In reply to: John H. Jenkins: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Next in thread: Jeroen Ruigrok van der Werven: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 29 2007 - 06:23:17 CST