> I hereby eat crow... Ken's explanation is 100% convincing.
> The main lesson I have learned here is that thinking industry standards
> requires much more attention to details than it is customary for the
> outlook of an average application programmer like me.
> Among the rest, I hadn't even imagined that IDS could have been taken from
> a pre-existing standard, and that therefore compatibility issues had to be
> I also did not very well consider (even if I had been warned) the
> possibility of such a huge number of variants. But, after Ken's mail, I
> tried and listed all the possible IDSs for a few very common components,
> multiplied the number of variants for the ideographs that contain these
> components, and I obtained a number that was more appropriate for
> astronomy than for writing systems... and this just considering a very
> very small subset of the whole...
> OK: I hope that other people had thoughts similar to mine, so that this
> short discussion could have at least some didactic value.
> Marco Cimarosti
> P.S. I still think that a "Unified Ideographs to IDS dictionary" could be
> a useful thing, and a starting point for some interesting *applicative*
> development, but I realize that by no means it would be a useful part of
> Unicode itself.
> -----Original Message-----
> From: firstname.lastname@example.org [SMTP:email@example.com]
> Sent: 1999 September 10, Friday 00.39
> To: Unicode List
> Cc: firstname.lastname@example.org; email@example.com
> Subject: RE: Ideographic Description
> Marco Cimarosti continued this discussion:
> > I cannot help thinking that IDS could be useful not only to
> provide a
> > human-readable "description" of rare ideographs, but also as a
> > *machine-readable* alternative spelling (or "decomposition") of
> any CJK
> > character, including the ones that are already coded in the
> > Ideographs section.
> > I have an impression that, at an early stage of design, this
> > could have been the original idea behind the proposal of IDS.
> Some might have thought that, but there are numerous Han character
> decomposition schemes that have been proposed, most of which are
> better designed for decomposition than the set of 12 IDC's currently
> in 10646/Unicode. Prof. Hsieh in Taiwan has a particularly
> and sound proposal, for example, that makes use of many fewer
> > One of John Jenkins' statements contibutes (willingly or not:-) to
> > impression:
> > "If there had been any requirement that Unicode conformance would
> > parsing and dealing with IDSs, they never would have made it into
> > standard."
> > As I read these words, I can nearly hear an echo of the clashes at
> > committee meeting!
> There were in fact such discussions at the WG2 meeting. Basically
> were the background for clarifying what the proposed IDC characters
> (from GBK) were, so that no one would be confused about their
> usage once they became part of the 10646/Unicode standards. The UTC,
> in particular, was adamantly opposed to having these characters
> in for compatibility with GBK and *then* having them turned to use
> decomposition rather than just ideograph description. So the
> you are trying to head with this is directly contrary to the
> expressed intent of the UTC in assenting to the encoding of the
> > But there are several other aspects that make me think that IDS
> was designed
> > primarily for rendering.
> > First of all, why a prefix notation? I think that many people
> would agree
> > that expressions using infix operators are much more
> You answer your own question below. Recursive infix operators are
> ambiguous as to the scope of their operands, requiring bracketing
> for clarification:
> > For our human brain, "infix" expressions (like the every-day
> arithmetic) are
> > quite intuitive:
> > 3 + 2 - 1
> > NOT this AND that
> > eye beside dog over dog beside dog
> This could be:
> eye beside (dog over (dog beside dog))
> (eye beside dog) over (dog beside dog)
> or even
> ((eye beside dog) over dog) beside dog
> It would be a bad design to use infix operators for plain text
> description of ideographs, because of this ambiguity.
> > and even more intuitive than "postfix" expression (as used in
> Postscript or
> > some old calculators):
> > 3 2 + 1 -
> > this NOT that AND
> > eye dog dog dog beside over beside
> The problem with a postfix expression for plain text is the
> issue. Most text processes try to operate in the logically forward
> direction, while limiting backtracking as much as possible because
> of its implications for efficiency. Postfix expressions work when
> have stack-oriented processing (as in PostScript), but in plain
> as you can see in your example, this just results in garden-path
> eye dog
> eye dog dog
> eye dog dog dog
> eye dog dog dog beside <== backup two & rebracket
> eye dog (dog dog beside)
> eye dog (dog dog beside) over <== backup four & rebracket
> eye (dog (dog dog beside) over)
> eye (dog (dog dog beside) over) beside <== backup six & rebracket
> (eye (dog (dog dog beside) over) beside)
> > On the other hand, postfix and prefix notations are much more
> easily parsed
> > by a computer (they are very performant syntaxes, especially the
> > notation, and by no means an "enormous overhead").
> The prefix notation is the logical choice for plain text
> all things considered.
> The "enormous overhead" that John Jenkins is talking about is not
> processing time required to deal with the trivial prefix BNF syntax
> for IDS's, but the huge equivalence tables that would have to be
> carried around to try to interpret all the equivalent relations for
> all the possible alternative representations of the "same"
> once you start allowing these descriptions to behave as
> And as John also pointed out, the devil is in the details. Once you
> start decomposing the characters, you get into a tremendous mess
> with variants that may or may not be the "same" thing. The
> and heuristics quickly start to spiral out of control.
> There are good reasons why no Han ideographic decomposition system
> has ever had any success as an *encoding* for text. Such systems
> sense only as adjuncts to the character-oriented
> for keyboards and input methods.
> > So, why has a hard-to-understand-but-easy-to-parse representation
> > preferred to a hard-to-parse-but-very-intuitive-to-read
> Explained above.
> > Another fact that makes me think that IDS is more suited for
> rendering than
> > for enjoying, is the great level of graphical detail provided by
> the IDCs.
> > Consider this example: provided that humans are intelligent, and
> > Far-Eastern humans can read their own languages, what is the need
> for having
> > 5 different IDCs for basically the same "surround" relation? That
> > IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT
> > A single "* IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND" would have
> > sufficient, as any reader is intelligent enough to infer the 5
> > different cases by the very shape of the surrounding component.
> I agree completely with you on this. The UTC pointed this out, but
> it came down to is that the 12 characters *already* existed as they
> in GBK, and were used as defined in GBK implementations. Their
> in 10646/Unicode is a compatibility issue with GBK, and there was no
> reason not to make the full set available to allow transparent
> with GBK implementations.
> > I hope it is clear what I am trying to say: I am not meaning in
> any way that
> > IDS is poorly designed or that is should be changed, but rather
> that it
> > looks as it was designed primarily for a different purpose.
> > Or, putting it even more positively, that it is designed in such a
> way that
> > it is *also* well fit for a different (and very interesting)
> purpose, beside
> > the one for which it is currently intended.
> Nope. Here is where I get off the bus.
> > I am just thinking that, once the IDS will be in place, it could
> be used
> > (maybe as a *higher-level protocol*), to achieve new and possible
> > applications:
> > - IDS-based input methods,
> There are already dozens of Chinese input methods. I see no
> advantage in adding another based on IDS, which would not even be
> > - "modular" CJK font schemes (containing more rules but far fewer
> This won't work with IDS, for reasons others have described.
> > - component-based searching ("find that word that contained the
> > radical somewhere").
> This should be done by database lookup. This kind of information in
> available in Unihan.txt (or could be extended to include other
> information). Such an approach is *far* *far* more efficient than
> up the text representation itself with mostly useless information
> that would
> make other text processes extraordinarily inefficient and that would
> be difficult to search correctly. ('dragon' itself can be broken
> and described in pieces -- how do I ensure that I have normalized
> text into the relevant chunks that I am searching for??)
> > Or even, exiting the computing kingdom:
> > - for a Braille representation of ideographs(!),
> Chinese Braille already exists. There are books published in it.
> > - as a new way to sort dictionaries.
> Why? There are already multiple ways to sort Han character
> Adding more abstruse methods of graphical sorting on top of the
> methods would serve what purpose?
> > However, for such experiments to be possible, one more piece
> should be added
> > to the game: an ***Unified Ideographs to IDS dictionary***.
> Good luck. I'm not volunteering!
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT