Re: IDS question

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Tue May 01 2001 - 19:12:45 EDT

Next message: Michael Everson: "Re: Tags and the Private Use Area"
Previous message: James Kass: "Re: UTF-8 on this list"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Mon, 30 Apr 2001, Kenneth Whistler wrote:

> Thomas Chan asked:
> > I've recently been using Ideographic Description Sequences to describe
> > some Han characters that are not in Unicode 3.1, and I noticed that
> > U+3007 is not included in the set of "UnifiedIdeographs", despite having
>
> It was never considered to be part of the set of characters being
> dealt with by the IRG for unification, as far as I know. Instead, it
> was just treated as one more of the symbols that was mapped out of
> the various East Asian character standards. There apparently never was
> any unification issue for it, since no one would have encoded it twice
> in a legacy character set, and there are no traditional variations in
> its shape.

I've never seen a dictionary that recognized U+3007 as a non-symbol
either, and I only know of one possible argument for it being one, which I
mentioned in a Oct 3, 2000 posting in the "U+3007 not a Hanzi?" thread
(egroups.com seems to not have archived it). I won't repeat all the
details here, but this was the exhibit (second column from the right):
http://deall.ohio-state.edu/grads/chan.200/misc/xin_tangshu-76.3481.jpg

> > However, those aren't valid sequences. I realize the above two characters
> > are rather odd, but the likes of U+3AB3 and U+3AC8 would have faced the
> > same problem, since they also incorporate a circular component.
>
> There are other characters that might be difficult to describe using
> IDS. Many of the oddballs in Extension B could fall into this category.
> Just on the first chart, 20008, 20067, 20069, 20073, and so on might be
> hard to describe in terms of the IDC's, because of their odd pieces.

Just out of curiousity, I looked those four up. U+20067 and U+20069 are
guwen, ancient (variant) forms of zhong 'center' U+4E2D. They probably
never existed naturally except in xiaozhuan 'lesser seal' forms, unlike
the mechanical conversion to contemporary Chinese writing forms as seen in
dictionaries, Unicode charts, etc. They don't seem to belong to the
chronologically later layer of Han characters that IDS were meant to
describe--perhaps they are best described with U+303E IDEOGRAPHIC
VARIATION INDICATOR, or just considered a kind of ancient z-variant.

I couldn't find U+20008 documented as a guwen, but it seems like one, a
variant of qiu 'hill' U+4E18. Like U+20067 and U+20069, it's also in
plane 6 of CNS 11643-1992, which hints that the compilers of that
considered it a guwen.

I don't have a reference for U+20073, and the unihan.txt file indicates it
was found in the _Siku Quanshu_ (G-4K) collectanea. However, I think it
could be described with IDS, using overlaying (U+2FFB)--although that is
not very satisfactory.

> Also, the use of IDC's was originally envisioned to include also a
> large number of "components", to complement the already encoded
> radicals, so that the common pattern of sticking a radical onto a
> component could be simply described in those instances where the
> component itself does not constitute a stand-alone character. Don't
> be surprised if China yet decides to submit hundreds of components for
> encoding, just to cover this kind of situation.

I think I might have seen some of these bits of pieces from GB
13000.1-93, which are in the PUA of fonts such as "MS Song", along with
pre-Unicode 3.0 IDC's, etc. HKSCS had some, too.

> However, I don't think the IDC's were intended to be a complete,
> closed mechanism for describing any ideograph ever encountered, no
> matter how bizarre (such as those for ideographs that just happened
> to be miscarved on a wood block at some point in history).

I don't have illusions about the limitions of IDS, and its not the only
notation around, either. However, it is probably one of the more
well-known schemes due to its inclusion and description in Unicode, and
I'd prefer to use it when possible for that reason.

> > What would be the advisable way to handle these cases, besides
> > creating invalid IDS sequences, using the PUA, or giving a prose
> > description?
>
> My suggestion would be that you just give prose descriptions, and
> check in with the IRG that these are included in their sources for
> work on Vertical Extension C.

A prose description sounds fine to me. Thank you for your comments on
this matter.

Do I simply ask the IRG (who, specifially?) if they will include some/all
Han characters from such-and-such dictionary?

> For this particular instance, I suppose you could also apply to
> the UTC with a proposal to add U+3007 to the IDS syntax, to make
> these two descriptions "legal". I'm not sure it is worth the effort,
> however.

Its not terribly important, no--I'll pass. Besides possibly deviating
from the GB 13000.1-93 version of IDS, and the relatively unproductive
role that U+3007 would play in IDS, there are probably other weird
"shapes" like triangles (for U+3403, etc) that could also be proposed to
participate in IDS. It's simpler to just submit them to Ext C, isn't it?

Thomas Chan
tc31@cornell.edu

Next message: Michael Everson: "Re: Tags and the Private Use Area"
Previous message: James Kass: "Re: UTF-8 on this list"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:16 EDT