From: Mark Davis ☕ (firstname.lastname@example.org)
Date: Sat May 08 2010 - 18:11:18 CDT
FYI, I have a table of radicals at
mapping them to Unified ideographs. Not yet complete (the X values are
tentative, and I don't know if there are values for the ones marked "#VALUE!
I had also tried taking a look at the data at
which Richard and John said was the best publicly available IDS
data (although it has a GPS licence, which prevents many people from using
it). While clearly a lot of work went into in, it is very flawed.
- There are over 400 ill-formed IDS sequences.
- There are 666 (coincidence?) characters that map to themselves (where
you'd only expect that of "base" radicals).
- About 5K characters are missing data.
- There appears to be free variation between using CJK radicals and using
the corresponding Unified CJK characters.
- It uses many NCR components with cryptic IDs, instead of radicals or
- A cursory look shows a signficant proportion of clear mistakes in the
data (characters stacked vertically in the wrong order, for example).
- Many characters cannot be recursively decomposed down to radicals.
So I'm not sure how much use the available IDS data would be in terms of
looking at necessary components.
FYI, I posted some generated files at http://macchiato.com/ids/, in case
anyone is curious as to details.
— Il meglio è l’inimico del bene —
On Sat, May 8, 2010 at 13:40, Asmus Freytag <email@example.com> wrote:
> On 5/8/2010 11:44 AM, Uriah Eisenstein wrote:
>> I've gone through the policies of submitting new characters and scripts
>> and they don't look encouraging :) But neither do they seem to reject the
>> idea of character fragments out of hand, as opposed to the reverse case -
>> characters which can be expressed using existing characters and combining
>> marks. In fact, the CJK Radicals Supplement block and the Hangul Jamo both
>> contain character fragments, in a way. But I suppose these already existed
>> in national standards rather than introduced by Unicode.
>> In any case, examples I've seen of proposals cite experts and provide font
>> makers, neither of whom I have contact with. So I guess I'll drop it for
>> now, and hope that if someone takes it up I'll see it on the mailing list.
> While a font is ultimately required for a proposal to become adopted, it
> shouldn't be a bar to formally raising the issue for initial consideration.
> Oncesomething is considered potentially acceptable, there's enough time to
> come up with fonts (for the purpose of printing charts) before the
> committees need to vote on final approval. Proposals can take years from
> initial consideration to publication....
> Your suggestion was that these fragments need to be enumerated for various
> purposes in software and that having a standard enumeration is beneficial.
> If you can document and support that assertion, I would encourage you to put
> it on record.
> Doing so would allow a discussion of whether a standard enumeration is
> indeed useful enough to encur the cost of standardization.
> In some ways, this would not be a run-of-the-mill character encoding
> proposal, because you are not asserting that these fragments need encoding
> for the purpose of directly expressing text. While that is the primary
> purpose of character encoding, there are purposes that are ancillary to
> this, that a universal character encoding such as Unicode must encompass.
> There is certainly some precedent for character codes that aren't limited
> to the primary purpose I mentioned, but, because they don't represent a
> standard situation, one needs to carefully argue why such uses need to be
> covered by standardization and if so, why doing that as character codes is
> That is different from the more usual task to document that an entity
> occurs in written or printed documents.
> The problem is, unless you actually put down all the details in a coherent
> proposal it's hard to judge correctly what the situation is. When you raise
> the question informally, all anyone can tell you is that an exceptional
> request is one that needs exceptional justification, which, while certainly
> correct, doesn't exacatly help you or anyone to evaluate whether your
> proposal would meet the required level and type of justification.
>> On Sun, May 2, 2010 at 3:06 PM, Uriah Eisenstein <
>> firstname.lastname@example.org <mailto:email@example.com>> wrote:
>> Not exactly, but I suppose such Hanzi fragments could be sued for
>> similar purposes - e.g. looking up characters by components, where
>> the available components may include non-character fragments. Some
>> fragments may be useful for IME purposes, but probably not all.
>> On Sat, May 1, 2010 at 8:57 PM, Edward Cherlin < firstname.lastname@example.org
>> <mailto:email@example.com>> wrote:
>> 2010/4/28 John H. Jenkins < firstname.lastname@example.org
>> > No. You could certainly write up a proposal and submit it
>> to the UTC.
>> > Should the UTC feel the idea has merit, it would then move
>> it on to WG2
>> > and/or the IRG.
>> > The main problem here is that there is a very strong desire
>> to limit
>> > ideograph encoding to attested and documentable forms.
>> Anything which does
>> > not exist in actual texts is not likely to be well-regarded.
>> I had the idea some years ago of writing up a proposal to
>> encode the
>> hanzi fragments used in Cangjie Shurufa IMEs. These fragments
>> are used
>> extensively in dozens of howto books on keyboarding in
>> Cangjie. This
>> includes the pieces (mostly real characters, with some
>> radicals) used
>> on keyboard labels, and the common forms they stand for. I
>> didn't get
>> any interest from the Cangjie development community or the
>> authors of
>> a book on Cangjie that I have, so i abandoned the idea.
>> Uriah, is this the sort of thing you have in mind?
>> > Similarly, the
>> > UTC has a strong preference not to encoding anything which
>> isn't in actual
>> > use. Proposals to encode characters because they would be
>> useful if encoded
>> > even though they aren't actually being used right now are
>> generally looked
>> > on with disfavor.
>> > 在 Apr 28, 2010 12:03 PM 時， Uriah Eisenstein 寫到：
>> > Hello,
>> > My question is about common components of CJK Ideographs
>> which are not
>> > encoded as independent Han characters (and perhaps indeed
>> aren't). A good
>> > example is the right-hand part of the character 漢 itself:
>> it is a distinct
>> > component appearing in multiple other characters, but is not
>> encoded to the
>> > best of my knowledge. The same goes for the top part of 鳥
>> and 島, the
>> > surrounding part of 與 and 興 and several others. My
>> question is whether there
>> > are any plans or discussions for encoding these fragments in
>> > (I haven't found anything about this in mailing list
>> archives; I did find
>> > statements that Unicode does not intend to provide any
>> decomposition data of
>> > Han characters :) And for good reasons. However, such
>> fragments may well be
>> > useful for third-party software dealing with 漢字 glyph
>> generation, lookup by
>> > components etc.)
>> > Thanks,
>> > Uriah Eisenstein
>> Edward Mokurai (默雷/धर्ममेघशब्दगर्ज/ دھرممیگھشبدگر ج) Cherlin
>> Silent Thunder is my name, and Children are my nation.
>> The Cosmos is my dwelling place, the Truth my destination.
This archive was generated by hypermail 2.1.5 : Sat May 08 2010 - 18:14:17 CDT