Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: John H. Jenkins (
Date: Sun Oct 28 2007 - 19:52:58 CST

  • Next message: Richard Ishida: "Version 6 of Unicode Converter now available"

    There are actually two different mechanisms incorporated into Unicode
    to allow some form of representation of unencoded ideographs. The
    first is the Ideographic Variation Indicator (U+303E), and the other
    is the Ideographic Description Sequence mechanism. Both of these are
    relatively crude graphically, although using IDSs you could probably
    come up with a reasonable visual representation of the shape intended
    most of the time. They are, however, ideal for embedding in text.

    There is also the CDL mechanism being worked on by Wenlin. This is
    XML-based and so is not really appropriate for embedding in plane
    text, but it is also capable of showing considerably greater
    flexibility in providing a precise visual representation of the
    intended shape.

    On the whole, however, the user community currently favors strongly
    the one ideograph-one Unicode character approach.

    The fundamental problem with a component-based approach to *encoding*
    (as opposed to representation) is the ambiguity involved. It is
    frequently possible to break down a character in more than one way. A
    simple example of this is the common character U+7AE0 (章), which
    could be represented using IDSs either as ⿱音十, ⿱立早, or ⿳
    立日十 (plus other possibilities caused by compatibility ideographs
    and encoded radicals). Trying to define a normalization for IDSs and
    allow for multiple spellings in searching or sorting would be a
    monumental task; this is one of the main reasons why component-based
    systems have never really gained momentum as a way to formally encoded
    unencoded characters.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Sun Oct 28 2007 - 19:55:15 CST