Re: New Ideographs in Unicode 3.0 and Beyond

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 08 1999 - 16:43:18 EDT


John Cowan asked:

> 5) The glyph shown for the Ideographic Variation Indicator
> is shown with an enclosing dotted rectangle. In Unicode 2.0, such
> glyphs appearing in the character tables were pseudo-glyphs
> for characters not to be rendered. How does Unicode 3.0
> make clear which dotted-rectangle glyphs are, and which are not,
> pseudo-glyphs?

All of the "pseudo-glyphs" contain Latin letters that abbreviate
the character names in some way. U+303E has a different dashed
box than the pseudo-glyphs, contains a symbol approximately but
not equal to, rather than letters, and has a prominent note in
the names list that states that it is a visibly displayed graphic
character and not an invislbe formatting control.

>
> 6) The BNF grammar on page 14 implies that a single ideograph by itself
> is an IDS: surely this is not correct. If this grammar appears
> in any authoritative text, there's a problem!

It is intentional -- not a mistake. Either a character itself
or a radical can be the shortest description of itself. That
lends substance to the preference to use the shortest description
possible. If a character is encoded, there is normally no reason to use
IDC's to write a longer IDS for it (except for limited didactic
purposes).

Implementations that intend to parse IDS's are not required to
slavishly implement exactly the BNF written in the book, which
is written to be the cleanest statement:

IDS ::= UnifiedIdeograph | Radical |
        BinaryDescriptionOperator IDS IDS |
        TrinaryDescriptionOperator IDS IDS IDS

If an implementation doesn't want to trigger special IDS processing
for every Han character, you just implement:

IDS ::= BinaryDescriptionOperator IDSNode IDSNode |
        TrinaryDescriptionOperator IDSNode IDSNode IDSNode
IDSNode ::= IDS | UnifiedIdeograph | Radical

and trigger special processing when you hit one of the IDC operators.

>
> 7) Page 17 says that IDSes cannot exceed 16 characters.
> Does this refer to Unicode abstract characters (= ISO 10646
> characters), which may be 16-bit or a 32-bit surrogate pair,
> or to 16-bit codes? The Unicode Standard 2.0
> regrettably uses "character" in multiple senses.
> (This too may need clarification in some standard.)
>

The count is 16 abstract characters, regardless of the encoding
form used.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT