Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: John H. Jenkins (jenkins@apple.com)
Date: Sun Oct 28 2007 - 19:52:58 CST

Next message: Richard Ishida: "Version 6 of Unicode Converter now available"

Previous message: Mark E. Shoulson: "Re: thorn vs. y or th, eth and other similar letters/signs"
In reply to: Ed Trager: "Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Next in thread: vunzndi@vfemail.net: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: vunzndi@vfemail.net: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There are actually two different mechanisms incorporated into Unicode
to allow some form of representation of unencoded ideographs. The
first is the Ideographic Variation Indicator (U+303E), and the other
is the Ideographic Description Sequence mechanism. Both of these are
relatively crude graphically, although using IDSs you could probably
come up with a reasonable visual representation of the shape intended
most of the time. They are, however, ideal for embedding in text.

There is also the CDL mechanism being worked on by Wenlin. This is
XML-based and so is not really appropriate for embedding in plane
text, but it is also capable of showing considerably greater
flexibility in providing a precise visual representation of the
intended shape.

On the whole, however, the user community currently favors strongly
the one ideograph-one Unicode character approach.

The fundamental problem with a component-based approach to *encoding*
(as opposed to representation) is the ambiguity involved. It is
frequently possible to break down a character in more than one way. A
simple example of this is the common character U+7AE0 (章), which
could be represented using IDSs either as ⿱音十, ⿱立早, or ⿳
立日十 (plus other possibilities caused by compatibility ideographs
and encoded radicals). Trying to define a normalization for IDSs and
allow for multiple spellings in searching or sorting would be a
monumental task; this is one of the main reasons why component-based
systems have never really gained momentum as a way to formally encoded
unencoded characters.

=====
John H. Jenkins
jenkins@apple.com

Next message: Richard Ishida: "Version 6 of Unicode Converter now available"
Previous message: Mark E. Shoulson: "Re: thorn vs. y or th, eth and other similar letters/signs"
In reply to: Ed Trager: "Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Next in thread: vunzndi@vfemail.net: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: vunzndi@vfemail.net: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Oct 28 2007 - 19:55:15 CST