Re: CJK Ideograph Fragments

From: Mark Davis ☕ (
Date: Sat May 08 2010 - 18:11:18 CDT

  • Next message: Christoph Burgmer: "Re: CJK Ideograph Fragments"

    FYI, I have a table of radicals at
    mapping them to Unified ideographs. Not yet complete (the X values are
    tentative, and I don't know if there are values for the ones marked "#VALUE!

    I had also tried taking a look at the data at
    which Richard and John said was the best publicly available IDS
    data (although it has a GPS licence, which prevents many people from using
    it). While clearly a lot of work went into in, it is very flawed.

       - There are over 400 ill-formed IDS sequences.
       - There are 666 (coincidence?) characters that map to themselves (where
       you'd only expect that of "base" radicals).
       - About 5K characters are missing data.
       - There appears to be free variation between using CJK radicals and using
       the corresponding Unified CJK characters.
       - It uses many NCR components with cryptic IDs, instead of radicals or
       Unified CJK.
       - A cursory look shows a signficant proportion of clear mistakes in the
       data (characters stacked vertically in the wrong order, for example).
       - Many characters cannot be recursively decomposed down to radicals.

    So I'm not sure how much use the available IDS data would be in terms of
    looking at necessary components.


    FYI, I posted some generated files at, in case
    anyone is curious as to details.

    — Il meglio è l’inimico del bene —

    On Sat, May 8, 2010 at 13:40, Asmus Freytag <> wrote:

    > On 5/8/2010 11:44 AM, Uriah Eisenstein wrote:
    >> Well,
    >> I've gone through the policies of submitting new characters and scripts
    >> and they don't look encouraging :) But neither do they seem to reject the
    >> idea of character fragments out of hand, as opposed to the reverse case -
    >> characters which can be expressed using existing characters and combining
    >> marks. In fact, the CJK Radicals Supplement block and the Hangul Jamo both
    >> contain character fragments, in a way. But I suppose these already existed
    >> in national standards rather than introduced by Unicode.
    >> In any case, examples I've seen of proposals cite experts and provide font
    >> makers, neither of whom I have contact with. So I guess I'll drop it for
    >> now, and hope that if someone takes it up I'll see it on the mailing list.
    > While a font is ultimately required for a proposal to become adopted, it
    > shouldn't be a bar to formally raising the issue for initial consideration.
    > Oncesomething is considered potentially acceptable, there's enough time to
    > come up with fonts (for the purpose of printing charts) before the
    > committees need to vote on final approval. Proposals can take years from
    > initial consideration to publication....
    > Your suggestion was that these fragments need to be enumerated for various
    > purposes in software and that having a standard enumeration is beneficial.
    > If you can document and support that assertion, I would encourage you to put
    > it on record.
    > Doing so would allow a discussion of whether a standard enumeration is
    > indeed useful enough to encur the cost of standardization.
    > In some ways, this would not be a run-of-the-mill character encoding
    > proposal, because you are not asserting that these fragments need encoding
    > for the purpose of directly expressing text. While that is the primary
    > purpose of character encoding, there are purposes that are ancillary to
    > this, that a universal character encoding such as Unicode must encompass.
    > There is certainly some precedent for character codes that aren't limited
    > to the primary purpose I mentioned, but, because they don't represent a
    > standard situation, one needs to carefully argue why such uses need to be
    > covered by standardization and if so, why doing that as character codes is
    > appropriate.
    > That is different from the more usual task to document that an entity
    > occurs in written or printed documents.
    > The problem is, unless you actually put down all the details in a coherent
    > proposal it's hard to judge correctly what the situation is. When you raise
    > the question informally, all anyone can tell you is that an exceptional
    > request is one that needs exceptional justification, which, while certainly
    > correct, doesn't exacatly help you or anyone to evaluate whether your
    > proposal would meet the required level and type of justification.
    > A./
    >> Thanks,
    >> Uriah
    >> On Sun, May 2, 2010 at 3:06 PM, Uriah Eisenstein <
    >> <>> wrote:
    >> Not exactly, but I suppose such Hanzi fragments could be sued for
    >> similar purposes - e.g. looking up characters by components, where
    >> the available components may include non-character fragments. Some
    >> fragments may be useful for IME purposes, but probably not all.
    >> On Sat, May 1, 2010 at 8:57 PM, Edward Cherlin <
    >> <>> wrote:
    >> 2010/4/28 John H. Jenkins <
    >> <>>:
    >> > No. You could certainly write up a proposal and submit it
    >> to the UTC.
    >> > Should the UTC feel the idea has merit, it would then move
    >> it on to WG2
    >> > and/or the IRG.
    >> > The main problem here is that there is a very strong desire
    >> to limit
    >> > ideograph encoding to attested and documentable forms.
    >> Anything which does
    >> > not exist in actual texts is not likely to be well-regarded.
    >> I had the idea some years ago of writing up a proposal to
    >> encode the
    >> hanzi fragments used in Cangjie Shurufa IMEs. These fragments
    >> are used
    >> extensively in dozens of howto books on keyboarding in
    >> Cangjie. This
    >> includes the pieces (mostly real characters, with some
    >> radicals) used
    >> on keyboard labels, and the common forms they stand for. I
    >> didn't get
    >> any interest from the Cangjie development community or the
    >> authors of
    >> a book on Cangjie that I have, so i abandoned the idea.
    >> Uriah, is this the sort of thing you have in mind?
    >> > Similarly, the
    >> > UTC has a strong preference not to encoding anything which
    >> isn't in actual
    >> > use. Proposals to encode characters because they would be
    >> useful if encoded
    >> > even though they aren't actually being used right now are
    >> generally looked
    >> > on with disfavor.
    >> >
    >> > 在 Apr 28, 2010 12:03 PM 時, Uriah Eisenstein 寫到:
    >> >
    >> > Hello,
    >> > My question is about common components of CJK Ideographs
    >> which are not
    >> > encoded as independent Han characters (and perhaps indeed
    >> aren't). A good
    >> > example is the right-hand part of the character 漢 itself:
    >> it is a distinct
    >> > component appearing in multiple other characters, but is not
    >> encoded to the
    >> > best of my knowledge. The same goes for the top part of 鳥
    >> and 島, the
    >> > surrounding part of 與 and 興 and several others. My
    >> question is whether there
    >> > are any plans or discussions for encoding these fragments in
    >> Unicode.
    >> >
    >> > (I haven't found anything about this in mailing list
    >> archives; I did find
    >> > statements that Unicode does not intend to provide any
    >> decomposition data of
    >> > Han characters :) And for good reasons. However, such
    >> fragments may well be
    >> > useful for third-party software dealing with 漢字 glyph
    >> generation, lookup by
    >> > components etc.)
    >> >
    >> > Thanks,
    >> > Uriah Eisenstein
    >> >
    >> >
    >> --
    >> Edward Mokurai (默雷/धर्ममेघशब्दगर्ज/ دھرممیگھشبدگر ج) Cherlin
    >> Silent Thunder is my name, and Children are my nation.
    >> The Cosmos is my dwelling place, the Truth my destination.

    This archive was generated by hypermail 2.1.5 : Sat May 08 2010 - 18:14:17 CDT