Re: New Ideographs in Unicode 3.0 and Beyond

From: John Jenkins (
Date: Wed Sep 08 1999 - 17:51:50 EDT

John Cowan writes:

> I have read John Jenkins's paper with the above title with
> great interest, and I have a few questions:
> 1) Was there a Unified Repertoire and Ordering 1.0? How did it
> differ from the URO 2.0 we have today?

It was essentially the same, if I recall. I don't believe that it's
documented anywhere, and I'm not sure how the old CJK-JRG's numbering system
worked. If I'm not mistaken, it is basically the URO as of the time the
CJK-JRG took it over.

> 2) Is it correct to say that the characters in Extension A
> belong to source character sets which were not considered in
> constructing the URO 2.0?

Not entirely, the main exception being CNS 11643-1992. Technically, the URO
is based on CNS 11643-1986, but there was some awareness of the 1992 version
of the standard in its final stages. For all practical intents and
purposes, the T-source for the CJK Unified Ideographs block is a very
restricted subset of CNS 11643-1992 (all of planes 1 and 2, and part of
plane 3), and the T-source for the Extension A is somewhat less restricted.

One of the unstated goals in Extension A was to extend the repertoire to
cover as much of CNS 11643-1992 as possible.

> 4) Nit: the "y" in "ye" is a substitute for thorn, not eth.

Thanks. This is what I get for working from memory.

> 5) The glyph shown for the Ideographic Variation Indicator
> is shown with an enclosing dotted rectangle. In Unicode 2.0, such
> glyphs appearing in the character tables were pseudo-glyphs
> for characters not to be rendered. How does Unicode 3.0
> make clear which dotted-rectangle glyphs are, and which are not,
> pseudo-glyphs?

As Ken says, the pseudo-glyphs have Latin letters in them. Personally, I'd
prefer a more obvious distinction. Maybe in 4.0.

> 6) The BNF grammar on page 14 implies that a single ideograph by itself
> is an IDS: surely this is not correct. If this grammar appears
> in any authoritative text, there's a problem!

This is correct. This is the formal basis for not allowing (non-trivial)
IDSs to represent encoded ideographs. Since each ideograph is itself an IDS
of length 1, then the shortest (and therefore "only allowed") IDS for that
ideograph is the ideograph itself. Something like that.

Or, as Ken says, you can rework the grammar slightly.

> 7) Page 17 says that IDSes cannot exceed 16 characters.
> Does this refer to Unicode abstract characters (= ISO 10646
> characters), which may be 16-bit or a 32-bit surrogate pair,
> or to 16-bit codes? The Unicode Standard 2.0
> regrettably uses "character" in multiple senses.
> (This too may need clarification in some standard.)

Abstract characters.

John H. Jenkins

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT