Re: Roadmap for the future allocation of UCS characters w/r

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 09 1997 - 18:35:30 EST


> From unicode@Unicode.ORG Thu Jan 9 14:35 PST 1997
> Date: Thu, 9 Jan 97 14:29:10 -0800
> Subject: Re: Roadmap for the future allocation of UCS characters w/r
> >
> >Right now the UTC is recommending that the IRG manage the allocation of Han
> >characters in Plane 2.
> >
> >Lisa Moore
>
>
> Thanks Lisa... It's been a long time since I heard about you.
>
> I understand this but it would be nice if, within plane 2, or within any
> eventual chunk allocated eventually in the BMP (plane 0), that it be
> allocated by chunks related to radical/strokecount segments. After all we
> know what is the number of radicals, we know the order of magnitude of the
> number of characters in the most comprehensive Hanzi dictionaries and we
> could even make provision for even more characters to be created.
>

I disagree that this approach will be likely to work for the Han
character additions. True, the radical order is already fixed by
the 20,902 in the URO. But how are the strokecount subsegments to be
allocated? To give an idea of the magnitude of the problem, consider
the "wood" radical, #75, a typical productive radical. Here is the subcount
for each strokecount group for the wood radical in the URO:

        0 : 2 10 : 85 20 : 3
        1 : 7 11 : 79 21 : 6
        2 : 19 12 : 79 22 : 2
        3 : 39 13 : 47 23 : 0
        4 : 67 14 : 32 24 : 2
        5 : 100 15 : 29
        6 : 94 16 : 15
        7 : 88 17 : 14
        8 : 108 18 : 9
        9 : 96 19 : 4

Now consider the "ji4 = broom?" radical, #58, a typical unproductive radical:

        0 : 2 10 : 2 20 : 0
        1 : 0 11 : 0 21 : 0
        2 : 2 12 : 0 22 : 0
        3 : 2 13 : 2 23 : 2
        4 : 0 14 : 0
        5 : 2 15 : 2
        6 : 1 16 : 0
        7 : 1 17 : 0
        8 : 1 18 : 0
        9 : 1 19 : 0

The distribution of strokecount is roughly bell-shaped, with a long tail to
the right (strokecount goes as high as 32 for the phonetic portion of the
character), and no doubt it would be possible to analyze the entire set to
come up with a distribution curve. But a predictive distribution that would
be "safe" for all future characters to fit in correctly would be complex to
generate. There would have to be a floor minimimum space for nearly every
cross-product of radical/strokecount--as witness the unexpected appearance of
2 characters with the ji4 radical and 23 additional strokes. You couldn't
merely scale up a distribution curve and come up with the right answer for
the space needed, especially at the tails of the curve in both directions,
where unusual combinations will no doubt occur.

The only safe ways to preallocate the arrangement would be either the enumerate
all the candidate characters (accounting for unification), or to provide a very
large "safety margin" against the chance, for example, of discovering 6 more
broom radical characters with 23 strokes in the phonetic. The former would
effectively involve doing all the work required to come up with the candidate
unified set for encoding in the first place.

The second approach, which I gather is more or less what Alain is proposing
(with only the size of the safety margin in question), would result in a
very sparse encoding. The disadvantage of that is that it leads to inefficient
table handling, both for conversions and for fonts.

There is a further disadvantage. The current approach, which takes the 20,902
as the initial set, which will soon be joined by another coherent chunk of
6585 in the Vertical Extension A, makes it relatively easy for a font or an
application to declare which set(s) it will be supporting. Vertical Extension A
support will probably be added as a coherent delta. The next "chunk" coming
in from the IRG will probably also be counted in the multiple thousands, and
will be perceived (and supported) as a kind of third level beyond the URO
and the Vertical Extension A. But if additional Han characters are salted into
a very sparse preallocated sorted matrix, the coherence of the additional
sets will be diffuse. It would be likely to encourage small additions of a
few characters at a time, and that, in turn could lead to no clear sense of
what font/application supports what set.
        

> It is desirable that allocation not be made randomly for such a huge
> character set. For small numbers of characters, it would not matter for
> ordering purposes at all, but for this one, otherwise we risk future
> nightmares when making default ordering tables (by radical and/or
> strokecount) in ISO/IEC 14651 for extra Chinese characters that will have to
> be inserted in the nicely-ordered set of Han characters in the current
> UCS/UNICODE.

As Keld points out, there is no easy way to sort Han characters. The
20,902 in the URO make it look easy, but already there is the block of
compatibility Han characters in the BMP which need to be accounted for,
and the Vertical Extension A, when it arrives, will have to be
interdigitated with the URO to produce a proper radical/strokecount order,
whether or not it is placed in the BMP or off the BMP, and whether or not
it is encoded in a tight chunk or in a sparse radical/strokecount matrix.

If ISO/IEC 14651 is set up in such a way that establishment of a standard
default order for Han characters can be so disrupted by the encoding of
an additional Han character somewhere else "out of order" that Alain
speaks of "future nightmares" for that standard, would it not make more
sense to simply insist on publication of a defined radical/strokecount
for each Han character in 10646 (as an annex to 10646?) and then have
14651 simply designate the default sort order for Han to be by the
radical/strokecount as defined in 10646? That information, if published
in machine-readable form, would enable any implementer to adjust a
radical/strokecount collation table fairly automatically as new Han
character sets are added to 10646.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT