RE: Ideographic Description

From: Kenneth Whistler (
Date: Thu Sep 09 1999 - 19:00:29 EDT

Marco Cimarosti continued this discussion:

> I cannot help thinking that IDS could be useful not only to provide a
> human-readable "description" of rare ideographs, but also as a
> *machine-readable* alternative spelling (or "decomposition") of any CJK
> character, including the ones that are already coded in the Unified
> Ideographs section.
> I have an impression that, at an early stage of design, this orientation
> could have been the original idea behind the proposal of IDS.

Some might have thought that, but there are numerous Han character
decomposition schemes that have been proposed, most of which are
better designed for decomposition than the set of 12 IDC's currently
in 10646/Unicode. Prof. Hsieh in Taiwan has a particularly well-researched
and sound proposal, for example, that makes use of many fewer operators.

> One of John Jenkins' statements contibutes (willingly or not:-) to this
> impression:
> "If there had been any requirement that Unicode conformance would imply
> parsing and dealing with IDSs, they never would have made it into the
> standard."
> As I read these words, I can nearly hear an echo of the clashes at some
> committee meeting!

There were in fact such discussions at the WG2 meeting. Basically they
were the background for clarifying what the proposed IDC characters
(from GBK) were, so that no one would be confused about their intended
usage once they became part of the 10646/Unicode standards. The UTC,
in particular, was adamantly opposed to having these characters brought
in for compatibility with GBK and *then* having them turned to use for
decomposition rather than just ideograph description. So the direction
you are trying to head with this is directly contrary to the unanimously
expressed intent of the UTC in assenting to the encoding of the IDC's.

> But there are several other aspects that make me think that IDS was designed
> primarily for rendering.
> First of all, why a prefix notation? I think that many people would agree
> that expressions using infix operators are much more human-readable.

You answer your own question below. Recursive infix operators are inherently
ambiguous as to the scope of their operands, requiring bracketing conventions
for clarification:

> For our human brain, "infix" expressions (like the every-day arithmetic) are
> quite intuitive:
> 3 + 2 - 1
> NOT this AND that
> eye beside dog over dog beside dog

This could be:

eye beside (dog over (dog beside dog))


(eye beside dog) over (dog beside dog)

or even

((eye beside dog) over dog) beside dog

It would be a bad design to use infix operators for plain text
description of ideographs, because of this ambiguity.

> and even more intuitive than "postfix" expression (as used in Postscript or
> some old calculators):
> 3 2 + 1 -
> this NOT that AND
> eye dog dog dog beside over beside

The problem with a postfix expression for plain text is the backtracking
issue. Most text processes try to operate in the logically forward
direction, while limiting backtracking as much as possible because
of its implications for efficiency. Postfix expressions work when you
have stack-oriented processing (as in PostScript), but in plain text,
as you can see in your example, this just results in garden-path

eye dog
eye dog dog
eye dog dog dog
eye dog dog dog beside <== backup two & rebracket
eye dog (dog dog beside)
eye dog (dog dog beside) over <== backup four & rebracket
eye (dog (dog dog beside) over)
eye (dog (dog dog beside) over) beside <== backup six & rebracket
(eye (dog (dog dog beside) over) beside)

> On the other hand, postfix and prefix notations are much more easily parsed
> by a computer (they are very performant syntaxes, especially the prefix
> notation, and by no means an "enormous overhead").

The prefix notation is the logical choice for plain text description,
all things considered.

The "enormous overhead" that John Jenkins is talking about is not the
processing time required to deal with the trivial prefix BNF syntax
for IDS's, but the huge equivalence tables that would have to be
carried around to try to interpret all the equivalent relations for
all the possible alternative representations of the "same" character,
once you start allowing these descriptions to behave as decompositions.
And as John also pointed out, the devil is in the details. Once you
start decomposing the characters, you get into a tremendous mess dealing
with variants that may or may not be the "same" thing. The combinatorics
and heuristics quickly start to spiral out of control.

There are good reasons why no Han ideographic decomposition system
has ever had any success as an *encoding* for text. Such systems make
sense only as adjuncts to the character-oriented encoding--particularly
for keyboards and input methods.

> So, why has a hard-to-understand-but-easy-to-parse representation been
> preferred to a hard-to-parse-but-very-intuitive-to-read representation?

Explained above.

> Another fact that makes me think that IDS is more suited for rendering than
> for enjoying, is the great level of graphical detail provided by the IDCs.
> Consider this example: provided that humans are intelligent, and that
> Far-Eastern humans can read their own languages, what is the need for having
> 5 different IDCs for basically the same "surround" relation? That is:
> sufficient, as any reader is intelligent enough to infer the 5 slightly
> different cases by the very shape of the surrounding component.

I agree completely with you on this. The UTC pointed this out, but what
it came down to is that the 12 characters *already* existed as they were
in GBK, and were used as defined in GBK implementations. Their encoding
in 10646/Unicode is a compatibility issue with GBK, and there was no good
reason not to make the full set available to allow transparent transcoding
with GBK implementations.

> I hope it is clear what I am trying to say: I am not meaning in any way that
> IDS is poorly designed or that is should be changed, but rather that it
> looks as it was designed primarily for a different purpose.
> Or, putting it even more positively, that it is designed in such a way that
> it is *also* well fit for a different (and very interesting) purpose, beside
> the one for which it is currently intended.

Nope. Here is where I get off the bus.

> I am just thinking that, once the IDS will be in place, it could be used
> (maybe as a *higher-level protocol*), to achieve new and possible useful
> applications:
> - IDS-based input methods,

There are already dozens of Chinese input methods. I see no particular
advantage in adding another based on IDS, which would not even be particularly

> - "modular" CJK font schemes (containing more rules but far fewer glyphs),

This won't work with IDS, for reasons others have described.

> - component-based searching ("find that word that contained the <dragon>
> radical somewhere").

This should be done by database lookup. This kind of information in
available in Unihan.txt (or could be extended to include other component
information). Such an approach is *far* *far* more efficient than cluttering
up the text representation itself with mostly useless information that would
make other text processes extraordinarily inefficient and that would itself
be difficult to search correctly. ('dragon' itself can be broken apart
and described in pieces -- how do I ensure that I have normalized the
text into the relevant chunks that I am searching for??)

> Or even, exiting the computing kingdom:
> - for a Braille representation of ideographs(!),

Chinese Braille already exists. There are books published in it.

> - as a new way to sort dictionaries.

Why? There are already multiple ways to sort Han character dictionaries.
Adding more abstruse methods of graphical sorting on top of the traditional
methods would serve what purpose?

> However, for such experiments to be possible, one more piece should be added
> to the game: an ***Unified Ideographs to IDS dictionary***.

Good luck. I'm not volunteering!


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT