Re: Ideographic Description

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 08 1999 - 17:11:24 EDT


Mario Cimarosti asked:

> 4) Would it be conformant to use an IDS in place of a character already
> encoded within CJK Unified Ideographs?

John Cowan answered:

> As long as you understand that you have something different, not
> equivalent in any sense (ordinary Unicode processes will not recognize
> the identity).

John Jenkins answered:

> No. Unicode adds to 10646 the formal requirement that an IDS be as short as
> possible, which would mean that using an IDS to describe an already encoded
> ideograph is non-conformant. Even more explicitly, Unicode says that this
> is forbidden.

Kevin Bracey answered:

> No. Actually, it's the sort of conformance requirement that would be
> impossible to enforce, but it would certainly be frowned upon mightily.

With deference to John Jenkins' expertise in this area, I need to soften
his reply somewhat. The Unicode Standard does add some formal requirements
regarding the use of Ideographic Description Characters (IDC) to create
Ideographic Description Sequences (IDS). The requirements include the
length restriction of 16 Unicode scalar values (i.e. encoded abstract
characters), and the backscan length restriction of 6 unified ideographs
in a row. Taken together, these also constrain the recursion depth of
an IDS. However, it is only *suggested* and not *required* that an IDS
be as short as possible:

"... As a rule, it is best to use the natural radical-phonetic division
for an ideograph if it has one and to use as short a description
sequence as possible, but there is no requirement that these rules be
followed. Beyond that, the shortest possible IDS is preferred."

This is mostly a common sense and legibility issue. If the point of the
IDS is to *describe* an unencoded ideograph, using a short sequence with
the most built-up pieces available in the standard is clearly preferable
to recursing all the way down to create an overanalyzed and less comprehensible
sequence.

If I want to describe an unencoded character that has the eye radical next
to the three dogs phonetic, it is better that I describe it as:

2FF0 76EE 730B [beside 'eye' 'three dogs']

rather than as:

2FF0 76EE 2FF1 72AC 2FF0 72AC 72AC [beside 'eye' over 'dog' beside 'dog' 'dog']

and better yet, if I discover that this ideograph actually *is* encoded
in Vertical Extension A, I am best off just using:

406D

But as John Cowan pointed out, as long as you know you are dealing with
"something else", and not a strong equivalence, you are free to use
the longer form. In fact, the didactic example I have just cited here
shows why there are instances when you *must* use the longer form to
explain the point!

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT