L2/09-094 From: Kenneth Whistler Title: Response to clarification of implicit weights for ideographs in UCA Action: For consideration by the UTC Ref: L2/09-090 http://www.unicode.org/L2/L2009/09090-uca-weight.html Introduction In response to the suggestions Mark Davis has made regarding the UCA categorization of ideographs for assigning implicit weights, I have some specific feedback, as listed below. Both of the suggestions in these feedback items refer to the existing text in Section 7.1.3 of UCA. ================================================================= Feedback Item #1: I suggest that the statement of the derivation of implicit weights in UCA Section 7.1.3 Implicit Weights (starting from the paragraph "To derive the collation elements...") , be updated to the following text for clarity of expression: ============================ revised text ====================== To derive the collation elements, the value of the code point is used to calculate two numbers, by bit shifting and bit masking. The bit operations are chosen so that the resultant numbers have the desired ranges for constructing implicit weights. The first number is calculated by taking the code point expressed as a 32-bit binary integer CP and bit shifting it right by 15 bits. Because code points range from U+0000 to U+10FFFF, the result will be a number in the range 0 to 2116 (= 3310). This number is then added to the special value BASE. AAAA = BASE + (CP >> 15); Now mask off the bottom 15 bits of CP. Or a 1 into bit 15, so that the resultant value is non-zero. BBBB = (CP & 0x7FFF) | 0x8000; AAAA and BBBB are interpreted as unsigned 16-bit integers. The implicit weight mapping given to the code point is then constructed as: [.AAAA.0020.0002.][.BBBB.0000.0000.] [[ And the rest of the text is unaffected by this particular clarification. ]] ========================== end revision ========================= Feedback Item #2: On the basic issue raised by Mark's document, I concur that the exact set of characters involved in the table defining the values for BASE needs to be more precisely specified. However, I disagree with the presumption stated in the document that interpretation of "CJK Ideograph" and "CJK Ideograph Extension A/B" is by blocks. That presumption implies sweeping up unassigned code points into the default weighting given to (Unified) CJK Ideographs, which I do not think was the intent, back when this was added to UCA in 2001. I have been unable (in finite time) to recover the full legislative history behind the change to Revision 9 of UCA. Revision 9 was published 2002-07-16, but it was based on L2/01-446, dated 2001-11-06, and in *that* document, the updated text for the Implicit Weights section of UCA appears verbatim as in the current version of UCA, with only the note: "[made the bottom of 7.1.2 a separate section, and generalized it]" I *think*, but am not positive, that this resulted from an August, 2001 UTC chalk talk about the problem involved. It is summarized only by: [88-C3] Consensus: The UTC approves the following changes to collation: * Give CJK binary ordering ... and my hand-written notes didn't provide any further clarification of that particular item at the time. At any rate, what Mark is now proposing is to do is to ratify the interpretation of the ICU implementation that: FB40 CJK Ideograph = [[:block=CJK_Unified_Ideographs:][:block=CJK_Compatibility_Ideographs:]] FB80 CJK Ideograph Extension A/B = [[:block=CJK_Unified_Ideographs_Extension_A:] [:block=CJK_Unified_Ideographs_Extension_B:]] including any unassigned code points in those ranges. And by corollary: FBC0 Any other code point: would *not* apply to unassigned code points in the CJK block ranges. The alternative interpretation, which I think is the correct one, is that the intent of these BASE assignments was always just for *assigned* Unified CJK Ideographs, and that the ICU implementation was a shortcut that assumed that it would be safer to simply assume that nothing but CJK Ideographs would be assigned in those blocks, so that it wouldn't have to revisit range changes. Under my interpretation, the meanings of the BASE assignments (for Unicode 5.1) are: FB40 CJK Ideograph = 4E00..9FC3, FA0E..FA2D [note that FA0E..FA2D is the range of the "IBM 32", but as Mark indicates for the CJK Compatibility Characters block, the way UCA works, it could just as well be expressed as ranges including assigned CJK compatibility characters (with canonical equivalences), since those are all bled away first, anyway. The important thing is that it *cannot* be expressed as the full block range, i.e. F900..FAFF, without picking up unassigned code points.] FB80 CJK Extension A/B = 3400..4DB5, 20000..2A6D6 FBC0 Any other code point: would then apply to *all* unassigned code points. The textual evidence for that (other than my recollection of how this all came about) consists of: 1. The UTC Consensus wording: "Give CJK binary ordering" 2. The DUCET Layout Table in Section 3.2 of UCA, which has a section for "implicit" which clarifies as follows: - CJK & CJK compatibility (those not decomposed) - CJK Extension A & B - Unassigned and others given implicit weights The crucial point here being that "unassigned" is here explicitly not included in the categories for CJK ideographs, and these 3 entries clearly are intended explicitly to echo the table of BASE values in Implicit Weights. 3. In Section 6.3.4 Reducing the Repertoire, a method for making ones table smaller by treating unsupported code points "as if they were unassigned" then points out to Derived Collation Elements where the implicit weights are calculated for unassigned characters. That also, IMO, strengthens my interpretation. My clear preference here would be to adopt the second interpretation, i.e., that the BASE value assignments are for *assigned* (Unified) CJK Ideographs, to give them a well-defined binary ordering, and does *not* imply mixing the handling of unassigned code points in the CJK Ideograph ranges. The advantage of Mark's interpretation is that block ranges are less likely to shift than the ranges of assigned characters within blocks, so that an implementation of UCA will need less updating to deal with circumstances like the dribbles of unified CJK ideographs we have added to the URO. Note, however, that any UCA implementation *will* need to be updated to deal with future CJK block additions, which is why Mark is suggesting extending "CJK Extension A & B" to include all of Plane 2. The advantage of my interpretation is that it is correct. ;-) Seriously, however, it means that unassigned characters that happen to lie within CJK ideographic blocks are treated identically to unassigned characters in any other block, which is a good thing. Also, Mark's proposal hasn't thought through what the implications are for the imminent addition of other CJK-like large ideographic blocks, which are *not* going to be allocated on Plane 2, but which will need parallel treatment as for CJK for derivation of implicit weights. In the interest of full disclosure, I believe that folks should know that both of us are motivated in our positions by defense of existing implementations. Mark is arguing for retrofitting the current ICU implementation interpretation and making it de jure in the UCA standard. My position reflects what is currently implemented and deployed in the Sybase implementation of UCA. I don't know what other implementations of UCA (or ISO 14651) have chosen on this topic, but because whoever loses in this argument is going to have to update deployed implementations, I suggest we make at least some effort to find out. On the other hand, the good news is that other than the details of the implementations involved, nobody will be much affected as end users one way or the other by how this turns out, as there is little dependent on the exact fallback implicit weights given to unassigned code points by a UCA implementation.