L2/09-094

From: Kenneth Whistler 
Title: Response to clarification of implicit weights for ideographs
in UCA
Action: For consideration by the UTC
Ref: L2/09-090 http://www.unicode.org/L2/L2009/09090-uca-weight.html


Introduction

In response to the suggestions Mark Davis has made regarding
the UCA categorization of ideographs for assigning implicit
weights, I have some specific feedback, as listed below.

Both of the suggestions in these feedback items refer to
the existing text in Section 7.1.3 of UCA. 

=================================================================

Feedback Item #1:

I suggest that the statement of the derivation of implicit weights
in UCA Section 7.1.3 Implicit Weights (starting from the paragraph
"To derive the collation elements...") , be updated to the following
text for clarity of expression:

============================ revised text ======================

To derive the collation elements, the value of the code point
is used to calculate two numbers, by bit shifting and bit
masking. The bit operations are chosen so that the resultant
numbers have the desired ranges for constructing implicit
weights. The first number is calculated by taking the code
point expressed as a 32-bit binary integer CP and bit shifting it
right by 15 bits. Because code points range from U+0000 to
U+10FFFF, the result will be a number in the range 0 to 21<sub>16</sub>
(= 33<sub>10</sub>). This number is then added to the special
value BASE.

AAAA = BASE + (CP >> 15);

Now mask off the bottom 15 bits of CP. Or a 1 into bit 15,
so that the resultant value is non-zero.

BBBB = (CP & 0x7FFF) | 0x8000;

AAAA and BBBB are interpreted as unsigned 16-bit integers. The implicit
weight mapping given to the code point is then constructed as:

[.AAAA.0020.0002.][.BBBB.0000.0000.]

[[ And the rest of the text is unaffected by this particular
clarification. ]]

========================== end revision =========================

Feedback Item #2:

On the basic issue raised by Mark's document, I concur that
the exact set of characters involved in the table defining
the values for BASE needs to be more precisely specified.
However, I disagree with the presumption stated in the
document that interpretation of "CJK Ideograph" and
"CJK Ideograph Extension A/B" is by blocks. That presumption
implies sweeping up unassigned code points into the default
weighting given to (Unified) CJK Ideographs, which I do not
think was the intent, back when this was added to UCA
in 2001.

I have been unable (in finite time) to recover the full legislative
history behind the change to Revision 9 of UCA. Revision 9
was published 2002-07-16, but it was based on L2/01-446,
dated 2001-11-06, and in *that* document, the updated text
for the Implicit Weights section of UCA appears verbatim as
in the current version of UCA, with only the note:

"[made the bottom of 7.1.2 a separate section, and generalized it]"

I *think*, but am not positive, that this resulted from an
August, 2001 UTC chalk talk about the problem involved.
It is summarized only by:

[88-C3] Consensus: The UTC approves the following changes to collation:

  * Give CJK binary ordering
  ...
  
and my hand-written notes didn't provide any further clarification of
that particular item at the time.

At any rate, what Mark is now proposing is to do is to
ratify the interpretation of the ICU implementation that:

FB40 CJK Ideograph =

  [[:block=CJK_Unified_Ideographs:][:block=CJK_Compatibility_Ideographs:]]
  
FB80 CJK Ideograph Extension A/B =

  [[:block=CJK_Unified_Ideographs_Extension_A:]
   [:block=CJK_Unified_Ideographs_Extension_B:]]
   
including any unassigned code points in those ranges.

And by corollary:

FBC0 Any other code point: would *not* apply to unassigned code
points in the CJK block ranges.

The alternative interpretation, which I think is the correct one,
is that the intent of these BASE assignments was always just
for *assigned* Unified CJK Ideographs, and that the ICU implementation
was a shortcut that assumed that it would be safer to simply assume
that nothing but CJK Ideographs would be assigned in those blocks,
so that it wouldn't have to revisit range changes.

Under my interpretation, the meanings of the BASE assignments
(for Unicode 5.1) are:

FB40 CJK Ideograph =

  4E00..9FC3, FA0E..FA2D
  
  [note that FA0E..FA2D is the range of the "IBM 32", but as Mark
  indicates for the CJK Compatibility Characters block, the way
  UCA works, it could just as well be expressed as ranges including
  assigned CJK compatibility characters (with canonical equivalences),
  since those are all bled away first, anyway. The important thing
  is that it *cannot* be expressed as the full block range,
  i.e. F900..FAFF, without picking up unassigned code points.]
  
FB80 CJK Extension A/B =

  3400..4DB5, 20000..2A6D6
  
FBC0 Any other code point: would then apply to *all* unassigned code
points.

The textual evidence for that (other than my recollection of how
this all came about) consists of:

1. The UTC Consensus wording: "Give CJK binary ordering"

2. The DUCET Layout Table in Section 3.2 of UCA, which has a
   section for "implicit" which clarifies as follows:
   
     - CJK & CJK compatibility (those not decomposed)
     - CJK Extension A & B
     - Unassigned and others given implicit weights
     
   The crucial point here being that "unassigned" is here
   explicitly not included in the categories for CJK ideographs,
   and these 3 entries clearly are intended explicitly to
   echo the table of BASE values in Implicit Weights.
   
3. In Section 6.3.4 Reducing the Repertoire, a method for
   making ones table smaller by treating unsupported code
   points "as if they were unassigned" then points out to
   Derived Collation Elements where the implicit weights are
   calculated for unassigned characters. That also, IMO,
   strengthens my interpretation.
   
My clear preference here would be to adopt the second
interpretation, i.e., that the BASE value assignments are
for *assigned* (Unified) CJK Ideographs, to give them a
well-defined binary ordering, and does *not* imply mixing
the handling of unassigned code points in the CJK Ideograph
ranges.

The advantage of Mark's interpretation is that block ranges
are less likely to shift than the ranges of assigned
characters within blocks, so that an implementation of
UCA will need less updating to deal with circumstances
like the dribbles of unified CJK ideographs we have added
to the URO. Note, however, that any UCA implementation *will*
need to be updated to deal with future CJK block additions,
which is why Mark is suggesting extending "CJK Extension A & B"
to include all of Plane 2.

The advantage of my interpretation is that it is correct. ;-)
Seriously, however, it means that unassigned characters that
happen to lie within CJK ideographic blocks are treated
identically to unassigned characters in any other block,
which is a good thing. Also, Mark's proposal hasn't thought
through what the implications are for the imminent addition
of other CJK-like large ideographic blocks, which are *not*
going to be allocated on Plane 2, but which will need
parallel treatment as for CJK for derivation of implicit
weights.

In the interest of full disclosure, I believe that folks should
know that both of us are motivated in our positions by
defense of existing implementations. Mark is arguing for
retrofitting the current ICU implementation interpretation
and making it de jure in the UCA standard. My position
reflects what is currently implemented and deployed in the
Sybase implementation of UCA. I don't know what other
implementations of UCA (or ISO 14651) have chosen on this
topic, but because whoever loses in this argument is going
to have to update deployed implementations, I suggest we
make at least some effort to find out.

On the other hand, the good news is that other than the
details of the implementations involved, nobody will be
much affected as end users one way or the other by how
this turns out, as there is little dependent on the exact
fallback implicit weights given to unassigned code points
by a UCA implementation.