RE: Ideographic Description

From: Marco.Cimarosti@icl.com
Date: Thu Sep 09 1999 - 13:23:38 EDT

Next message: Michael Everson: "RE: Ideographic Description"
Previous message: Kenneth Whistler: "Mixtec tones (was: Re: orthographic characters for glottal stop)"
Maybe in reply to: Marco.Cimarosti@icl.com: "Ideographic Description"
Next in thread: Michael Everson: "RE: Ideographic Description"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hallo!

Thank you very much for the fast and informative answers that many list
members gave to my questions about IDS.

People interested in the same topic may find, at the end of this mail, a
compact summary of my questions and the answers I received so far.

I could not stop myself from annoying you with a little more thinking about
this IDS thing, also in light of the new information that I just learned.

I cannot help thinking that IDS could be useful not only to provide a
human-readable "description" of rare ideographs, but also as a
*machine-readable* alternative spelling (or "decomposition") of any CJK
character, including the ones that are already coded in the Unified
Ideographs section.

I have an impression that, at an early stage of design, this orientation
could have been the original idea behind the proposal of IDS.

One of John Jenkins' statements contibutes (willingly or not:-) to this
impression:

"If there had been any requirement that Unicode conformance would imply
parsing and dealing with IDSs, they never would have made it into the
standard."

As I read these words, I can nearly hear an echo of the clashes at some
committee meeting!

And the reason for the battle is easily understood: implementers of
CJK-oriented solutions are already burdened by several complex issues (huge
fonts, complicated input methods, converting from a crowd of character sets,
a variety of different encoding schemes, etc.) that they have absolutely no
reason to welcome further complications (expression parsing, ligatures,
context-sensitive shapes, "2-dimensional kerning", etc.).

But there are several other aspects that make me think that IDS was designed
primarily for rendering.

First of all, why a prefix notation? I think that many people would agree
that expressions using infix operators are much more human-readable.

For our human brain, "infix" expressions (like the every-day arithmetic) are
quite intuitive:
        3 + 2 - 1
        NOT this AND that
        eye beside dog over dog beside dog

much more intuitive than the "prefix" expressions of IDS:
        - + 3 2 1
        AND NOT this that
        beside eye over dog beside dog dog

and even more intuitive than "postfix" expression (as used in Postscript or
some old calculators):
        3 2 + 1 -
        this NOT that AND
        eye dog dog dog beside over beside

On the other hand, postfix and prefix notations are much more easily parsed
by a computer (they are very performant syntaxes, especially the prefix
notation, and by no means an "enormous overhead").

The prefix notation of IDS is particularly handy for coding the "docking" of
graphic elements in a fixed-size space. In fact, when our hypothetical
renderer parses the <beside> token, it already has all the information it
needs to allocate the remaining space to the 2 (or 3) elements that follow
(including the context information needed to choose among different possible
shapes, or to decide ad-hoc kerning) with little need for redundant database
look-ups or backward reparsing.

On the other hand, when infix expressions contain more than one operator,
they become ambiguous and need to be disambiguated by parentheses (unless
complex precedence and associativity rules exist):
        (3 + 2) - 1
        (NOT this) AND that
        <eye> <beside> '(' <dog> <over> '(' <dog> <beside> <dog> ')' ')'

This makes infix expressions slightly longer, and much more complex to be
parsed. But, if the scope of IDS was just to provide a *human*-readable
representation, this wouldn't have been an issue.

So, why has a hard-to-understand-but-easy-to-parse representation been
preferred to a hard-to-parse-but-very-intuitive-to-read representation?

Another fact that makes me think that IDS is more suited for rendering than
for enjoying, is the great level of graphical detail provided by the IDCs.

Consider this example: provided that humans are intelligent, and that
Far-Eastern humans can read their own languages, what is the need for having
5 different IDCs for basically the same "surround" relation? That is:
        IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND
        IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE
        IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW
        IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT
        IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT
        IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT
        IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT

A single "* IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND" would have been
sufficient, as any reader is intelligent enough to infer the 5 slightly
different cases by the very shape of the surrounding component.

Again, this design seems to me much more oriented to computers (stupid and
blind) than to humans (intelligent and sighted).

I hope it is clear what I am trying to say: I am not meaning in any way that
IDS is poorly designed or that is should be changed, but rather that it
looks as it was designed primarily for a different purpose.

Or, putting it even more positively, that it is designed in such a way that
it is *also* well fit for a different (and very interesting) purpose, beside
the one for which it is currently intended.

I am just thinking that, once the IDS will be in place, it could be used
(maybe as a *higher-level protocol*), to achieve new and possible useful
applications:
- IDS-based input methods,
- "modular" CJK font schemes (containing more rules but far fewer glyphs),
- component-based searching ("find that word that contained the <dragon>
radical somewhere").

Or even, exiting the computing kingdom:
- for a Braille representation of ideographs(!),
- as a new way to sort dictionaries.

However, for such experiments to be possible, one more piece should be added
to the game: an ***Unified Ideographs to IDS dictionary***.

Replying to my question (7), John Cowan wisely says that no one stops me to
come up with such a thing, if I want to. He is certainly right in the sense
that I cannot expect someone else to cook my meal for me (hoping the Italian
idiom makes sense in English).

Unluckily, after a rough estimate, I am afraid that I don't have enough free
man-days in my expected life-frame to do such a thing. So, I will abuse of
this list for angling my bait: if anyone is tuned who wishes to waste part
of their time to help doing such a thing, or should any employer be crazy
enough to pay me a salary for splitting kanji's, well: my e-mail is on the
top line...

Regards, and thanks again.
Marco Cimarosti, Italy

Q&A summary:
---------------------------------------------------------------------

>> 1) What is IDS really for? Why has this feature been introduced
in ISO-10646?

> K. Bracey: It's to allow you some way to describe the character
> you want even though it has not (yet) been encoded properly in the
> UCS.

> J. Cowan: As you say, so that ideographs that cannot be coded
> directly can at least be described. There is no hope of coding
> every existing ideograph, because there is no authoritative list of
> all ideographs that have ever been used anywhere (new documents are
> periodically dug up, literally, that are written on tortoise shell
> or bone), and new ideographs are constantly, though slowly, being
> created.

> J. Jenkins: They are intended to provide a means of including
> unencoded ideographs in text. The presumption is that there will
> never be a time when *all* possible ideographs in actual use
> (present or past) will be formally encoded, and some mechanism is
> required to handle the missing ones.

>> 2) Will these addition be integated in Unicode as well?

> K. Bracey, J. Cowan, J. Jenkins: Yes, in Unicode 3.0.

>> 3) Document [1] explicitly states that an IDS "describes the
>> ideograph in the abstract form. It is not interpreted as a
>> composed character and does not have any rendering implication."
>> -- OK: pretty rendering of IDSs is not *required* to conformant
>> applications, but is it *forbidded*?

> K. Bracey: It's not required, and it's not forbidden. A composed
> ideograph of the correct form would just be a kind of ligature, and
> implementations are free to represent sequences of glyphs however
> they like. For a particular application you might have a font with
> the glyph in that substituted it in as a ligature for an IDS
> describing it.

> J. Cowan: I don't see how it could be *forbidden*.

> J. Jenkins: Unicode requires that IDCs must have some visual
> appearance. Applications may choose to parse the IDS and render
> appropriately, but it isn't recommended.

>> 4) Would it be conformant to use an IDS in place of a character
>> already encoded within CJK Unified Ideographs?

> K. Bracey: No. Actually, it's the sort of conformance
> requirement that would be impossible to enforce, but it would
> certainly be frowned upon mightily.

> J. Cowan: As long as you understand that you have something
> different, not equivalent in any sense (ordinary Unicode processes
> will not recognize the identity).

> J. Jenkins: No. Unicode adds to 10646 the formal requirement that
> an IDS be as short as possible, which would mean that using an IDS
> to describe an already encoded ideograph is non-conformant.
> Even more explicitly, Unicode says that this is forbidden.

> Ken W.: With deference to John Jenkins' expertise in this
> area, I need to soften his reply somewhat. The Unicode Standard
> does add some formal requirements regarding the use of
> Ideographic Description Characters (IDC) to create Ideographic
> Description Sequences (IDS). The requirements include the length
> restriction of 16 Unicode scalar values (i.e. encoded abstract
> characters), and the backscan length restriction of 6 unified
> ideographs in a row. Taken together, these also constrain the
> recursion depth of an IDS. However, it is only *suggested* and not
> *required* that an IDS be as short as possible:
> "... As a rule, it is best to use the natural
> radical-phonetic division for an ideograph if it has one and to use
> as short a description sequence as possible, but there is no
> requirement that these rules be followed. Beyond that, the shortest
> possible IDS is preferred."
> This is mostly a common sense and legibility issue. If the point of
> the IDS is to *describe* an unencoded ideograph, using a short
> sequence with the most built-up pieces available in the standard is
> clearly preferable to recursing all the way down to create an
> overanalyzed and less comprehensible sequence.
> If I want to describe an unencoded character that has the eye
> radical next to the three dogs phonetic, it is better that I
> describe it as:
> 2FF0 76EE 730B
> [beside 'eye' 'three dogs']
> rather than as:
> 2FF0 76EE 2FF1 72AC 2FF0 72AC 72AC
> [beside 'eye' over 'dog' beside 'dog' 'dog']
> and better yet, if I discover that this ideograph actually *is*
> encoded in Vertical Extension A, I am best off just using:
> 406D
> But as John Cowan pointed out, as long as you know you are dealing
> with "something else", and not a strong equivalence, you are free
> to use the longer form. In fact, the didactic example I have just
> cited here shows why there are instances when you *must* use the
> longer form to explain the point!

>> 5) What if one only uses Description Components (DC) form the
>> new "Kangxi Radicals" and "CJK Radicals Supplement": would it be
>> possible to build valid IDSs for *all* the encoded CJK Unified
>> Ideographs using only these elements?

> K. Bracey: Good question. I'm pretty certain the answer is that
> you could encode most of the ideographs, but far from all.
> Certainly many of the fine details of the characters would be lost.

> J. Cowan: Probably not. The traditional analysis uses 214
> radicals (the KangXi set) and about 1000 phonetics.

> J. Jenkins: No. Not by a long shot. Most of the common phonetic
> elements of ideographs don't occur in either radical block.
> It should be pointed out that Unicode considers radicals and
> ideographs semantically distinct (although that distinction is
> blurred in the case of IDSs).

>> 6) Some of the Kangxi radicals (especially those with stroke
>> number >= 10) could be expressed with an IDS, using simpler
>> components. Would this be considered conformant to make an IDS
>> that "decomposes" a Kangxi radical?

> K. Bracey, J. Jenkins: No.

> J. Cowan: Conformant to what? IDSes are compact *descriptions*,
> the equivalent of writing "(Insert an ideograph here that looks
> like a *foo* above a *bar*)".

>> 7) Will ISO/IEC ever publish a list of IDSs for existing CJK
>> Unified Ideographs? (I.e. a sort of decomposition mapping file)?

> K. Bracey: I hope not.

> J. Cowan: I sort of doubt it, but there is nothing stopping
> *you* from doing so.

> J. Jenkins: No.

[Moreover, about my old idea of a reduced (8-bit?) character set for CJK
characters.]

> K. Bracey: Nice in theory, crummy in practice. Each character
> would require so many component parts to encode you would end up
> with something about 3 times as large as the standard UTF-16
> representation, even if you could have 1 byte per component, and
> the renderer would just end up having to have a map from each
> IDS -> correct glyph. You're not going to get typographically
> acceptable results by generating the glyphs algorithmically from
> the IDS.

> J. Cowan: In that case, the stroke level probably makes more
> sense than the component level. Stroke writing order is
> standardized, and probably 40-50 strokes would do it all.

> J. Jenkins: It has been tried before. There are a number of
> problems.
> 1) There is too much ambiguity. Any scheme sufficiently
> powerful to reduce the set of some 80,000 to 100,000 required
> ideographs to a set of 256 root forms plus combining controls would
> also fall afoul of the various alternate shapes that a single
> ideograph can take, plus ambiguities in the process of breaking
> ideographs into pieces.
> 2) There is *enormous* overhead in trying to render IDSs.
> There is enormous overhead in even trying to parse them for the
> sake of cursor movement and line breaking. We don't even want to
> talk about working on semantic equivalents for searching/replacing
> or the ramifications of collation.
> Unicode minimizes this overhead by stating that none of this
> need be done. If there had been any requirement that Unicode
> comformance would imply parsing and dealing with IDSs, they never
> would have made it into the standard.

---------------------------------------------------------------------
End of summary.

Next message: Michael Everson: "RE: Ideographic Description"
Previous message: Kenneth Whistler: "Mixtec tones (was: Re: orthographic characters for glottal stop)"
Maybe in reply to: Marco.Cimarosti@icl.com: "Ideographic Description"
Next in thread: Michael Everson: "RE: Ideographic Description"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT