Re: CJK Ideograph Fragments

From: Mark Davis ☕ (
Date: Mon May 10 2010 - 10:07:00 CDT

  • Next message: John H. Jenkins: "Re: CJK Ideograph Fragments"

    As I said, I would not rely too heavily on the accuracy of that data. Where
    there are ?, or NCRs, or truncated IDS sequences, it looks like the missing
    character can often be supplied by examining the character.


    — Il meglio è l’inimico del bene —

    On Mon, May 10, 2010 at 05:43, Uriah Eisenstein

    > Hello Mr. Davis and thanks for the lists,
    > I've found several different sources for character compositions (though
    > none of them seem to include Extension characters except for the generated
    > files you have posted!). While they all have missing information and
    > occasional mistakes, it is quite easy to find unencoded fragments in them,
    > these are usually marked with ? or something similar. I've been making a
    > list of fragments while working with cjklib, mentioned by Christoph; some
    > I've found later in Extension A or B, others I have character examples and
    > could be used for an initial proposal. I don't expect the set of necessary
    > components to be complete anytime soon, or at all, not anymore than the
    > entire set of Ideographs :)
    > Regards,
    > Uriah
    > On Sun, May 9, 2010 at 2:11 AM, Mark Davis ☕ <> wrote:
    >> FYI, I have a table of radicals at
    >> mapping them to Unified ideographs. Not yet complete (the X values are
    >> tentative, and I don't know if there are values for the ones marked "
    >> #VALUE!").
    >> I had also tried taking a look at the data at
    >> which Richard and John said was the best publicly available IDS
    >> data (although it has a GPS licence, which prevents many people from using
    >> it). While clearly a lot of work went into in, it is very flawed.
    >> - There are over 400 ill-formed IDS sequences.
    >> - There are 666 (coincidence?) characters that map to themselves
    >> (where you'd only expect that of "base" radicals).
    >> - About 5K characters are missing data.
    >> -
    >> - There appears to be free variation between using CJK radicals and
    >> using the corresponding Unified CJK characters.
    >> - It uses many NCR components with cryptic IDs, instead of radicals or
    >> Unified CJK.
    >> - A cursory look shows a signficant proportion of clear mistakes in
    >> the data (characters stacked vertically in the wrong order, for example).
    >> - Many characters cannot be recursively decomposed down to radicals.
    >> So I'm not sure how much use the available IDS data would be in terms of
    >> looking at necessary components.
    >> Mark
    >> FYI, I posted some generated files at, in case
    >> anyone is curious as to details.
    >> — Il meglio è l’inimico del bene —
    >> On Sat, May 8, 2010 at 13:40, Asmus Freytag <> wrote:
    >>> On 5/8/2010 11:44 AM, Uriah Eisenstein wrote:
    >>>> Well,
    >>>> I've gone through the policies of submitting new characters and scripts
    >>>> and they don't look encouraging :) But neither do they seem to reject the
    >>>> idea of character fragments out of hand, as opposed to the reverse case -
    >>>> characters which can be expressed using existing characters and combining
    >>>> marks. In fact, the CJK Radicals Supplement block and the Hangul Jamo both
    >>>> contain character fragments, in a way. But I suppose these already existed
    >>>> in national standards rather than introduced by Unicode.
    >>>> In any case, examples I've seen of proposals cite experts and provide
    >>>> font makers, neither of whom I have contact with. So I guess I'll drop it
    >>>> for now, and hope that if someone takes it up I'll see it on the mailing
    >>>> list.
    >>> While a font is ultimately required for a proposal to become adopted, it
    >>> shouldn't be a bar to formally raising the issue for initial consideration.
    >>> Oncesomething is considered potentially acceptable, there's enough time to
    >>> come up with fonts (for the purpose of printing charts) before the
    >>> committees need to vote on final approval. Proposals can take years from
    >>> initial consideration to publication....
    >>> Your suggestion was that these fragments need to be enumerated for
    >>> various purposes in software and that having a standard enumeration is
    >>> beneficial. If you can document and support that assertion, I would
    >>> encourage you to put it on record.
    >>> Doing so would allow a discussion of whether a standard enumeration is
    >>> indeed useful enough to encur the cost of standardization.
    >>> In some ways, this would not be a run-of-the-mill character encoding
    >>> proposal, because you are not asserting that these fragments need encoding
    >>> for the purpose of directly expressing text. While that is the primary
    >>> purpose of character encoding, there are purposes that are ancillary to
    >>> this, that a universal character encoding such as Unicode must encompass.
    >>> There is certainly some precedent for character codes that aren't limited
    >>> to the primary purpose I mentioned, but, because they don't represent a
    >>> standard situation, one needs to carefully argue why such uses need to be
    >>> covered by standardization and if so, why doing that as character codes is
    >>> appropriate.
    >>> That is different from the more usual task to document that an entity
    >>> occurs in written or printed documents.
    >>> The problem is, unless you actually put down all the details in a
    >>> coherent proposal it's hard to judge correctly what the situation is. When
    >>> you raise the question informally, all anyone can tell you is that an
    >>> exceptional request is one that needs exceptional justification, which,
    >>> while certainly correct, doesn't exacatly help you or anyone to evaluate
    >>> whether your proposal would meet the required level and type of
    >>> justification.
    >>> A./
    >>>> Thanks,
    >>>> Uriah
    >>>> On Sun, May 2, 2010 at 3:06 PM, Uriah Eisenstein <
    >>>> <>> wrote:
    >>>> Not exactly, but I suppose such Hanzi fragments could be sued for
    >>>> similar purposes - e.g. looking up characters by components, where
    >>>> the available components may include non-character fragments. Some
    >>>> fragments may be useful for IME purposes, but probably not all.
    >>>> On Sat, May 1, 2010 at 8:57 PM, Edward Cherlin <
    >>>> <>> wrote:
    >>>> 2010/4/28 John H. Jenkins <
    >>>> <>>:
    >>>> > No. You could certainly write up a proposal and submit it
    >>>> to the UTC.
    >>>> > Should the UTC feel the idea has merit, it would then move
    >>>> it on to WG2
    >>>> > and/or the IRG.
    >>>> > The main problem here is that there is a very strong desire
    >>>> to limit
    >>>> > ideograph encoding to attested and documentable forms.
    >>>> Anything which does
    >>>> > not exist in actual texts is not likely to be well-regarded.
    >>>> I had the idea some years ago of writing up a proposal to
    >>>> encode the
    >>>> hanzi fragments used in Cangjie Shurufa IMEs. These fragments
    >>>> are used
    >>>> extensively in dozens of howto books on keyboarding in
    >>>> Cangjie. This
    >>>> includes the pieces (mostly real characters, with some
    >>>> radicals) used
    >>>> on keyboard labels, and the common forms they stand for. I
    >>>> didn't get
    >>>> any interest from the Cangjie development community or the
    >>>> authors of
    >>>> a book on Cangjie that I have, so i abandoned the idea.
    >>>> Uriah, is this the sort of thing you have in mind?
    >>>> > Similarly, the
    >>>> > UTC has a strong preference not to encoding anything which
    >>>> isn't in actual
    >>>> > use. Proposals to encode characters because they would be
    >>>> useful if encoded
    >>>> > even though they aren't actually being used right now are
    >>>> generally looked
    >>>> > on with disfavor.
    >>>> >
    >>>> > 在 Apr 28, 2010 12:03 PM 時, Uriah Eisenstein 寫到:
    >>>> >
    >>>> > Hello,
    >>>> > My question is about common components of CJK Ideographs
    >>>> which are not
    >>>> > encoded as independent Han characters (and perhaps indeed
    >>>> aren't). A good
    >>>> > example is the right-hand part of the character 漢 itself:
    >>>> it is a distinct
    >>>> > component appearing in multiple other characters, but is not
    >>>> encoded to the
    >>>> > best of my knowledge. The same goes for the top part of 鳥
    >>>> and 島, the
    >>>> > surrounding part of 與 and 興 and several others. My
    >>>> question is whether there
    >>>> > are any plans or discussions for encoding these fragments in
    >>>> Unicode.
    >>>> >
    >>>> > (I haven't found anything about this in mailing list
    >>>> archives; I did find
    >>>> > statements that Unicode does not intend to provide any
    >>>> decomposition data of
    >>>> > Han characters :) And for good reasons. However, such
    >>>> fragments may well be
    >>>> > useful for third-party software dealing with 漢字 glyph
    >>>> generation, lookup by
    >>>> > components etc.)
    >>>> >
    >>>> > Thanks,
    >>>> > Uriah Eisenstein
    >>>> >
    >>>> >
    >>>> --
    >>>> Edward Mokurai (默雷/धर्ममेघशब्दगर्ज/ دھرممیگھشبدگر ج) Cherlin
    >>>> Silent Thunder is my name, and Children are my nation.
    >>>> The Cosmos is my dwelling place, the Truth my destination.

    This archive was generated by hypermail 2.1.5 : Mon May 10 2010 - 10:10:07 CDT