Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

Date: Mon Oct 29 2007 - 06:17:59 CST

  • Next message: Philippe Verdy: "RE: thorn vs. y or th, eth and other similar letters/signs"

    Quoting "John H. Jenkins" <>:

    > There are actually two different mechanisms incorporated into Unicode
    > to allow some form of representation of unencoded ideographs. The
    > first is the Ideographic Variation Indicator (U+303E), and the other is
    > the Ideographic Description Sequence mechanism. Both of these are
    > relatively crude graphically, although using IDSs you could probably
    > come up with a reasonable visual representation of the shape intended
    > most of the time. They are, however, ideal for embedding in text.
    > There is also the CDL mechanism being worked on by Wenlin. This is
    > XML-based and so is not really appropriate for embedding in plane text,
    > but it is also capable of showing considerably greater flexibility in
    > providing a precise visual representation of the intended shape.
    > On the whole, however, the user community currently favors strongly the
    > one ideograph-one Unicode character approach.

    The strongest advocate for precomposed characters is China, but then
    China would also prefere to have precomposed Tibetan.

    As to the average end user there only concern is what works, the end
    user who wishes to type presently encoded characters would
    probably not notice a difference, the end user who wants to type
    presently unencoded characters which are just simple a combination of
    already encoded characters would immediately notice an improvement.

    > The fundamental problem with a component-based approach to *encoding*
    > (as opposed to representation) is the ambiguity involved. It is
    > frequently possible to break down a character in more than one way. A
    > simple example of this is the common character U+7AE0 (?), which could
    > be represented using IDSs either as ???, ???, or ???? (plus other
    > possibilities caused by compatibility ideographs and encoded radicals).
    > Trying to define a normalization for IDSs and allow for multiple
    > spellings in searching or sorting would be a monumental task; this is
    > one of the main reasons why component-based systems have never really
    > gained momentum as a way to formally encoded unencoded characters.

    Normalising the IDS of a character like U+7AE0 isn't that difficult,
    about 20 lines of clumsly written perl script and a good knowledge of
    polish nontation algebra would be enough (I have used this approach),
    though of course everyone would need to normalise in the same way. The
    heart of the problem though is that IDC just give an approximation,
    which if normalised would produce a accurate result of about 80% of
    characters, includingU+7AE0. For the remaining 20% one needs something
    more than IDC/IDS. Though a monumental task much of the work has been
    done with the work on using IDS to check for duplicates.

    The technical diffiulties of searching for composite characters etc
    are on a par with the many scripts in unicode that use composite
    characters, a pain to program but doable.

    When I have talked with Chinese publishers about IT difficulties the
    most common issue raised by far is how to add characters, the number
    of which would be reduced to almost zero if a composite model than
    precompossed model was used.

    Stabilty rules about cannonical equivalance may well be the biggest obstacle.

    Yours sincerely
    John Knightley

    > =====
    > John H. Jenkins

    This message sent through Virus Free Email

    This message sent through Virus Free Email

    This archive was generated by hypermail 2.1.5 : Mon Oct 29 2007 - 06:23:17 CST