RE: Ideographic Description

From: Marco.Cimarosti@icl.com
Date: Fri Sep 10 1999 - 05:37:41 EDT


> I hereby eat crow... Ken's explanation is 100% convincing.
>
> The main lesson I have learned here is that thinking industry standards
> requires much more attention to details than it is customary for the
> outlook of an average application programmer like me.
>
> Among the rest, I hadn't even imagined that IDS could have been taken from
> a pre-existing standard, and that therefore compatibility issues had to be
> considered.
>
> I also did not very well consider (even if I had been warned) the
> possibility of such a huge number of variants. But, after Ken's mail, I
> tried and listed all the possible IDSs for a few very common components,
> multiplied the number of variants for the ideographs that contain these
> components, and I obtained a number that was more appropriate for
> astronomy than for writing systems... and this just considering a very
> very small subset of the whole...
>
> OK: I hope that other people had thoughts similar to mine, so that this
> short discussion could have at least some didactic value.
>
> Regards.
> Marco Cimarosti
>
> P.S. I still think that a "Unified Ideographs to IDS dictionary" could be
> a useful thing, and a starting point for some interesting *applicative*
> development, but I realize that by no means it would be a useful part of
> Unicode itself.
>
> -----Original Message-----
> From: kenw@sybase.com [SMTP:kenw@sybase.com]
> Sent: 1999 September 10, Friday 00.39
> To: Unicode List
> Cc: unicode@unicode.org; kenw@sybase.com
> Subject: RE: Ideographic Description
>
> Marco Cimarosti continued this discussion:
>
> > I cannot help thinking that IDS could be useful not only to
> provide a
> > human-readable "description" of rare ideographs, but also as a
> > *machine-readable* alternative spelling (or "decomposition") of
> any CJK
> > character, including the ones that are already coded in the
> Unified
> > Ideographs section.
> >
> > I have an impression that, at an early stage of design, this
> orientation
> > could have been the original idea behind the proposal of IDS.
>
> Some might have thought that, but there are numerous Han character
> decomposition schemes that have been proposed, most of which are
> better designed for decomposition than the set of 12 IDC's currently
> in 10646/Unicode. Prof. Hsieh in Taiwan has a particularly
> well-researched
> and sound proposal, for example, that makes use of many fewer
> operators.
>
> >
> > One of John Jenkins' statements contibutes (willingly or not:-) to
> this
> > impression:
> >
> > "If there had been any requirement that Unicode conformance would
> imply
> > parsing and dealing with IDSs, they never would have made it into
> the
> > standard."
> >
> > As I read these words, I can nearly hear an echo of the clashes at
> some
> > committee meeting!
>
> There were in fact such discussions at the WG2 meeting. Basically
> they
> were the background for clarifying what the proposed IDC characters
> (from GBK) were, so that no one would be confused about their
> intended
> usage once they became part of the 10646/Unicode standards. The UTC,
> in particular, was adamantly opposed to having these characters
> brought
> in for compatibility with GBK and *then* having them turned to use
> for
> decomposition rather than just ideograph description. So the
> direction
> you are trying to head with this is directly contrary to the
> unanimously
> expressed intent of the UTC in assenting to the encoding of the
> IDC's.
>
> >
> > But there are several other aspects that make me think that IDS
> was designed
> > primarily for rendering.
> >
> > First of all, why a prefix notation? I think that many people
> would agree
> > that expressions using infix operators are much more
> human-readable.
>
> You answer your own question below. Recursive infix operators are
> inherently
> ambiguous as to the scope of their operands, requiring bracketing
> conventions
> for clarification:
>
> >
> > For our human brain, "infix" expressions (like the every-day
> arithmetic) are
> > quite intuitive:
> > 3 + 2 - 1
> > NOT this AND that
> > eye beside dog over dog beside dog
>
> This could be:
>
> eye beside (dog over (dog beside dog))
>
> or
>
> (eye beside dog) over (dog beside dog)
>
> or even
>
> ((eye beside dog) over dog) beside dog
>
> It would be a bad design to use infix operators for plain text
> description of ideographs, because of this ambiguity.
>
> >
> > and even more intuitive than "postfix" expression (as used in
> Postscript or
> > some old calculators):
> > 3 2 + 1 -
> > this NOT that AND
> > eye dog dog dog beside over beside
>
> The problem with a postfix expression for plain text is the
> backtracking
> issue. Most text processes try to operate in the logically forward
> direction, while limiting backtracking as much as possible because
> of its implications for efficiency. Postfix expressions work when
> you
> have stack-oriented processing (as in PostScript), but in plain
> text,
> as you can see in your example, this just results in garden-path
> processing:
>
> eye
> eye dog
> eye dog dog
> eye dog dog dog
> eye dog dog dog beside <== backup two & rebracket
> eye dog (dog dog beside)
> eye dog (dog dog beside) over <== backup four & rebracket
> eye (dog (dog dog beside) over)
> eye (dog (dog dog beside) over) beside <== backup six & rebracket
> (eye (dog (dog dog beside) over) beside)
>
> >
> > On the other hand, postfix and prefix notations are much more
> easily parsed
> > by a computer (they are very performant syntaxes, especially the
> prefix
> > notation, and by no means an "enormous overhead").
>
> The prefix notation is the logical choice for plain text
> description,
> all things considered.
>
> The "enormous overhead" that John Jenkins is talking about is not
> the
> processing time required to deal with the trivial prefix BNF syntax
> for IDS's, but the huge equivalence tables that would have to be
> carried around to try to interpret all the equivalent relations for
> all the possible alternative representations of the "same"
> character,
> once you start allowing these descriptions to behave as
> decompositions.
> And as John also pointed out, the devil is in the details. Once you
> start decomposing the characters, you get into a tremendous mess
> dealing
> with variants that may or may not be the "same" thing. The
> combinatorics
> and heuristics quickly start to spiral out of control.
>
> There are good reasons why no Han ideographic decomposition system
> has ever had any success as an *encoding* for text. Such systems
> make
> sense only as adjuncts to the character-oriented
> encoding--particularly
> for keyboards and input methods.
>
> >
> > So, why has a hard-to-understand-but-easy-to-parse representation
> been
> > preferred to a hard-to-parse-but-very-intuitive-to-read
> representation?
>
> Explained above.
>
> >
> > Another fact that makes me think that IDS is more suited for
> rendering than
> > for enjoying, is the great level of graphical detail provided by
> the IDCs.
> >
> > Consider this example: provided that humans are intelligent, and
> that
> > Far-Eastern humans can read their own languages, what is the need
> for having
> > 5 different IDCs for basically the same "surround" relation? That
> is:
> > IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT
> > IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT
> >
> > A single "* IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND" would have
> been
> > sufficient, as any reader is intelligent enough to infer the 5
> slightly
> > different cases by the very shape of the surrounding component.
>
> I agree completely with you on this. The UTC pointed this out, but
> what
> it came down to is that the 12 characters *already* existed as they
> were
> in GBK, and were used as defined in GBK implementations. Their
> encoding
> in 10646/Unicode is a compatibility issue with GBK, and there was no
> good
> reason not to make the full set available to allow transparent
> transcoding
> with GBK implementations.
>
> >
> > I hope it is clear what I am trying to say: I am not meaning in
> any way that
> > IDS is poorly designed or that is should be changed, but rather
> that it
> > looks as it was designed primarily for a different purpose.
> >
> > Or, putting it even more positively, that it is designed in such a
> way that
> > it is *also* well fit for a different (and very interesting)
> purpose, beside
> > the one for which it is currently intended.
>
> Nope. Here is where I get off the bus.
>
> >
> > I am just thinking that, once the IDS will be in place, it could
> be used
> > (maybe as a *higher-level protocol*), to achieve new and possible
> useful
> > applications:
> > - IDS-based input methods,
>
> There are already dozens of Chinese input methods. I see no
> particular
> advantage in adding another based on IDS, which would not even be
> particularly
> effective.
>
> > - "modular" CJK font schemes (containing more rules but far fewer
> glyphs),
>
> This won't work with IDS, for reasons others have described.
>
> > - component-based searching ("find that word that contained the
> <dragon>
> > radical somewhere").
>
> This should be done by database lookup. This kind of information in
> available in Unihan.txt (or could be extended to include other
> component
> information). Such an approach is *far* *far* more efficient than
> cluttering
> up the text representation itself with mostly useless information
> that would
> make other text processes extraordinarily inefficient and that would
> itself
> be difficult to search correctly. ('dragon' itself can be broken
> apart
> and described in pieces -- how do I ensure that I have normalized
> the
> text into the relevant chunks that I am searching for??)
>
> >
> > Or even, exiting the computing kingdom:
> > - for a Braille representation of ideographs(!),
>
> Chinese Braille already exists. There are books published in it.
>
> > - as a new way to sort dictionaries.
>
> Why? There are already multiple ways to sort Han character
> dictionaries.
> Adding more abstruse methods of graphical sorting on top of the
> traditional
> methods would serve what purpose?
>
> >
> > However, for such experiments to be possible, one more piece
> should be added
> > to the game: an ***Unified Ideographs to IDS dictionary***.
> >
>
> Good luck. I'm not volunteering!
>
> --Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT