Re: CJK Ideograph Fragments

From: Mark Davis ☕ (mark@macchiato.com)
Date: Sat May 08 2010 - 18:11:18 CDT

  • Next message: Christoph Burgmer: "Re: CJK Ideograph Fragments"

    FYI, I have a table of radicals at
    https://spreadsheets.google.com/pub?key=0AqRLrRqNEKv-dHlVMzY0RFZ3MTFLZ0RldS1RNXN4Z3c&hl=en&output=html
    mapping them to Unified ideographs. Not yet complete (the X values are
    tentative, and I don't know if there are values for the ones marked "#VALUE!
    ").

    I had also tried taking a look at the data at
    http://cvs.m17n.org/viewcvs/chise/ids/?sortdir=down&pathrev=kawabata#dirlist(IDS
    which Richard and John said was the best publicly available IDS
    data (although it has a GPS licence, which prevents many people from using
    it). While clearly a lot of work went into in, it is very flawed.

       - There are over 400 ill-formed IDS sequences.
       - There are 666 (coincidence?) characters that map to themselves (where
       you'd only expect that of "base" radicals).
       - About 5K characters are missing data.
       -
       - There appears to be free variation between using CJK radicals and using
       the corresponding Unified CJK characters.
       - It uses many NCR components with cryptic IDs, instead of radicals or
       Unified CJK.
       - A cursory look shows a signficant proportion of clear mistakes in the
       data (characters stacked vertically in the wrong order, for example).
       - Many characters cannot be recursively decomposed down to radicals.

    So I'm not sure how much use the available IDS data would be in terms of
    looking at necessary components.

    Mark

    FYI, I posted some generated files at http://macchiato.com/ids/, in case
    anyone is curious as to details.

    — Il meglio è l’inimico del bene —

    On Sat, May 8, 2010 at 13:40, Asmus Freytag <asmusf@ix.netcom.com> wrote:

    > On 5/8/2010 11:44 AM, Uriah Eisenstein wrote:
    >
    >> Well,
    >> I've gone through the policies of submitting new characters and scripts
    >> and they don't look encouraging :) But neither do they seem to reject the
    >> idea of character fragments out of hand, as opposed to the reverse case -
    >> characters which can be expressed using existing characters and combining
    >> marks. In fact, the CJK Radicals Supplement block and the Hangul Jamo both
    >> contain character fragments, in a way. But I suppose these already existed
    >> in national standards rather than introduced by Unicode.
    >>
    >> In any case, examples I've seen of proposals cite experts and provide font
    >> makers, neither of whom I have contact with. So I guess I'll drop it for
    >> now, and hope that if someone takes it up I'll see it on the mailing list.
    >>
    > While a font is ultimately required for a proposal to become adopted, it
    > shouldn't be a bar to formally raising the issue for initial consideration.
    > Oncesomething is considered potentially acceptable, there's enough time to
    > come up with fonts (for the purpose of printing charts) before the
    > committees need to vote on final approval. Proposals can take years from
    > initial consideration to publication....
    >
    > Your suggestion was that these fragments need to be enumerated for various
    > purposes in software and that having a standard enumeration is beneficial.
    > If you can document and support that assertion, I would encourage you to put
    > it on record.
    >
    > Doing so would allow a discussion of whether a standard enumeration is
    > indeed useful enough to encur the cost of standardization.
    >
    > In some ways, this would not be a run-of-the-mill character encoding
    > proposal, because you are not asserting that these fragments need encoding
    > for the purpose of directly expressing text. While that is the primary
    > purpose of character encoding, there are purposes that are ancillary to
    > this, that a universal character encoding such as Unicode must encompass.
    >
    > There is certainly some precedent for character codes that aren't limited
    > to the primary purpose I mentioned, but, because they don't represent a
    > standard situation, one needs to carefully argue why such uses need to be
    > covered by standardization and if so, why doing that as character codes is
    > appropriate.
    >
    > That is different from the more usual task to document that an entity
    > occurs in written or printed documents.
    >
    > The problem is, unless you actually put down all the details in a coherent
    > proposal it's hard to judge correctly what the situation is. When you raise
    > the question informally, all anyone can tell you is that an exceptional
    > request is one that needs exceptional justification, which, while certainly
    > correct, doesn't exacatly help you or anyone to evaluate whether your
    > proposal would meet the required level and type of justification.
    >
    > A./
    >
    >>
    >> Thanks,
    >> Uriah
    >>
    >>
    >> On Sun, May 2, 2010 at 3:06 PM, Uriah Eisenstein <
    >> uriaheisenstein@gmail.com <mailto:uriaheisenstein@gmail.com>> wrote:
    >>
    >> Not exactly, but I suppose such Hanzi fragments could be sued for
    >> similar purposes - e.g. looking up characters by components, where
    >> the available components may include non-character fragments. Some
    >> fragments may be useful for IME purposes, but probably not all.
    >>
    >>
    >> On Sat, May 1, 2010 at 8:57 PM, Edward Cherlin < echerlin@gmail.com
    >> <mailto:echerlin@gmail.com>> wrote:
    >>
    >> 2010/4/28 John H. Jenkins < jenkins@apple.com
    >> <mailto:jenkins@apple.com>>:
    >>
    >> > No. You could certainly write up a proposal and submit it
    >> to the UTC.
    >> > Should the UTC feel the idea has merit, it would then move
    >> it on to WG2
    >> > and/or the IRG.
    >> > The main problem here is that there is a very strong desire
    >> to limit
    >> > ideograph encoding to attested and documentable forms.
    >> Anything which does
    >> > not exist in actual texts is not likely to be well-regarded.
    >>
    >> I had the idea some years ago of writing up a proposal to
    >> encode the
    >> hanzi fragments used in Cangjie Shurufa IMEs. These fragments
    >> are used
    >> extensively in dozens of howto books on keyboarding in
    >> Cangjie. This
    >> includes the pieces (mostly real characters, with some
    >> radicals) used
    >> on keyboard labels, and the common forms they stand for. I
    >> didn't get
    >> any interest from the Cangjie development community or the
    >> authors of
    >> a book on Cangjie that I have, so i abandoned the idea.
    >>
    >> Uriah, is this the sort of thing you have in mind?
    >>
    >> > Similarly, the
    >> > UTC has a strong preference not to encoding anything which
    >> isn't in actual
    >> > use. Proposals to encode characters because they would be
    >> useful if encoded
    >> > even though they aren't actually being used right now are
    >> generally looked
    >> > on with disfavor.
    >> >
    >> > 在 Apr 28, 2010 12:03 PM 時, Uriah Eisenstein 寫到:
    >> >
    >> > Hello,
    >> > My question is about common components of CJK Ideographs
    >> which are not
    >> > encoded as independent Han characters (and perhaps indeed
    >> aren't). A good
    >> > example is the right-hand part of the character 漢 itself:
    >> it is a distinct
    >> > component appearing in multiple other characters, but is not
    >> encoded to the
    >> > best of my knowledge. The same goes for the top part of 鳥
    >> and 島, the
    >> > surrounding part of 與 and 興 and several others. My
    >> question is whether there
    >> > are any plans or discussions for encoding these fragments in
    >> Unicode.
    >> >
    >> > (I haven't found anything about this in mailing list
    >> archives; I did find
    >> > statements that Unicode does not intend to provide any
    >> decomposition data of
    >> > Han characters :) And for good reasons. However, such
    >> fragments may well be
    >> > useful for third-party software dealing with 漢字 glyph
    >> generation, lookup by
    >> > components etc.)
    >> >
    >> > Thanks,
    >> > Uriah Eisenstein
    >> >
    >> >
    >>
    >>
    >>
    >> --
    >> Edward Mokurai (默雷/धर्ममेघशब्दगर्ज/ دھرممیگھشبدگر ج) Cherlin
    >> Silent Thunder is my name, and Children are my nation.
    >> The Cosmos is my dwelling place, the Truth my destination.
    >> http://www.earthtreasury.org/
    >>
    >>
    >>
    >>
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat May 08 2010 - 18:14:17 CDT