L2/05-302 Date: October 13, 2005 Title: Collation Upper- and Lower-Bound Characters Source: Ken Whistler Action: For discussion by the UTC Background In L2/04-319, Mark Davis suggested designating two (existing) characters as having special weights in the context of the Unicode Collation Algorithm. Detail can be seen in L2/04-319, but in essence: LB (Lower-bound): Given a (non-ignorable) primary weight so that it sorts above all ignorable characters, but below all other non-ignorable characters. UB (Upper-bound): Given a primary weight greater than any other character. Both of these concepts have valid applications in a database context. LB can be used as a delimiter in concatenating multiple fields into strings that then can be given a single key for comparison. This has the effect of merged field comparison, as documented in: http://www.unicode.org/reports/tr10/#Interleaved_Levels UB can be used to implement substring range comparison. As stated in L2/04-319, if you want to do a Select statement on a range "Bag" to "Biz", but you want to pick up all entries that *begin* with the substring "Biz", and not just "Biz" itself (i.e. "Biza", "Bizness", "Bizlbub", etc.), then you can use "Biz" + as your upper range comparator string. This works because "Biz" will compare higher than any other string starting with "Biz". L2/04-319 suggested several possible candidates for LB and UB: LB: U+0002 START OF TEXT U+001F UNIT SEPARATOR U+009C STRING TERMINATOR UB: U+0003 END OF TEXT U+0004 END OF TRANSMISSION U+0019 END OF MEDIUM And there are plenty of other possible candidate characters that see little actual use, and which could be given special weighting behavior such as this for UCA. L2/04-319 did not work out the details of exactly how weights would be assigned in DUCET for LB and UB, or even if they would have to be assigned in DUCET per se. Given the structure of weight assignments in DUCET, it would even be possible to take a given version of DUCET, calculate the relevant values, and assign weights for LB and UB at runtime. However, L2/04-319 did propose that LB and UB would need to be guaranteed to be untailorable, because the whole point is that LB and UB be guaranteed to always be lower than (or greater than, respectively) all other non-ignorable character weights. If you don't constrain their tailoring, it would be trivial to create a tailoring that would break those relationships. L2/04-319 concluded: "Currently ICU uses API to get the functionality of LB, and uses a private use character for UB (U+10FFFF). This works perfectly fine for us. But the above requirements are quite general, so we believe that it would be generally useful to have two distinguished characters with the same meaning across different implementations." [Picking a nit: U+10FFFF is not a private-use character, but a noncharacter.] Having mulled this over for awhile, I see some problems in the proposal that lead me to believe that proceeding with the proposed approach of designating two (existing) characters as LB and UB and weighting them accordingly in the DUCET is not advisable. Problem 1: Exactly what to do for weighting The problem here is variable weight assignment in DUCET. Abbreviating from the current table allkeys.txt, to highlight the significant transition points, we have: 0000 ; [.0000.0000.0000.0000] # [0000] NULL (in 6429) ... 0009 ; [*0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429) ... 0020 ; [*0209.0020.0002.0020] # SPACE 0021 ; [*0253.0020.0002.0021] # EXCLAMATION MARK ... 10A47 ; [*0F34.0020.0002.10A47] # KHAROSHTHI NUMBER ONE THOUSAND ... 0332 ; [.0000.0021.0002.0332] # COMBINING LOW LINE ... 20EB ; [.0000.017B.0002.20EB] # COMBINING LONG DOUBLE SOLIDUS OVERLAY 02D0 ; [.0F35.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON ... 2FA1D ; [.FB85.0020.0002.2A600][.A600.0000.0000.2A600] # CJK COMPATIBILITY IDEOGRAPH-2FA1D; QQC The elements marked with a "*" are the variable collation elements. Depending on the options chosen for key generation (Blanked versus Shifted, for example), the primary weight for them may either end up zeroed out or not. *0201 is the lowest assigned variable collation element weight, and *0F34 is the highest (in the current table). 0200 would be the lowest possible primary weight in the table, but it currently is not used. U+20EB is the lowest *non*-ignorable character given a primary weight, with its primary weight 0F35 starting just above that of the highest primary weight for the variable collation elements. And then primary weight range all the way up into the very high constructed double element implicit weights for Han characters (and unassigned code points -- which don't appear in the table). The problem is this: there are two incompatible potential choices for weighting LB, and neither of them would work for all choices of key generation options for the variable weighted characters. Choice A: Give LB the weight 0F35 and shift all the subsequent weights in the table up by one. This seems to be what was implied by the proposal in L2/05-319. It would work for the Blanked or Shifted key generation options, but if the Non-ignorable option were chosen, the LB weight would no longer be the lowest primary, and comparisons would fail for all of the characters with primary weights in the ranges 0201..0F34. Choice B: Give LB the weight 0200. This would, indeed, work for all the key generation options, since even if you choose Non-ignorable, 0200 is guaranteed to be a lower primary that that for the lowest variable collation element, namely *0201 for TAB. The problem, however, is that this breaks another design element of the table which was deliberately put in place to enable table memory optimizations: all primary weight values for "ignorable" characters (namely the variable collation elements) are guaranteed to be lower than the lowest non-ignorable primary weight. Since LB has to have a non-ignorable weight, that relationship would be broken, and implementations which optimize based on the range guarantees would also be broken. There might be some other way to handle the weighting that would preserve both the lowest primary guarantee for LB and the range optimization guarantee, but I'm having trouble finding it. But before attempting to hack further at the table to engineer a solution, I have to wonder if it really is necessary to find one. ICU and other implementations handle the merged field problem via API, rather than by using a formally designated character for this function. Applications have traditionally been handling the multi-field comparison problem for years by simple concatenation schemes. Concatenating field values with a delimiter character, typically some punctuation mark that sorts lower than letters and digits, gets you the right answer in most cases, as long as you choose a character that sorts lower than any data you actually allow in the data fields. You can even do the field delimitation with one of the control codes that was designed for data field separation in the first place, U+001F UNIT SEPARATOR. The argument that the *character* for this has to be interoperable strikes me as a weak one, because the usefulness of the LB is in processing structured data in database contexts. It is a transient delimitation used in comparing multiple data fields, rather than the actual *content* which would be entered or queried back from the data. And I don't see it catching on as an alternative for the existing protocols for flattening out fielded data with delimiters for interchange as plain text. The weight to assign for UB is also a bit of a problem. The largest values normally assigned by the algorithm come from the use of implicit weights for Han characters and unassigned code points (and noncharacters). The highest would be for U+10FFFF: U+10FFFF => [.FBE1.0020.0002][.FFFF.0000.0000] So you could guarantee that UB would sort higher than that by choosing a primary weight > FBE1. However, the BASE values for implicit weights aren't required to be the exact values given in Section 7.1.3 of UCA, and it is tricky to figure out how to keep the UB weight stably highest while allowing implementations to fiddle with BASE and implicit weight generation, unless UB was simply assigned FFFF as a primary weight and FFFF was otherwise disallowed in DUCET or weights directly derived from it. The easiest thing to do is exactly what ICU does. Just use the *non*character U+10FFFF internally as your implementation's UB function. Unless you have done something funky in implementing UCA, that will give you your highest key value automatically. And since, just as for LB, there is very little evidence that UB needs to function as an interchangeable character in plain text, it seems preferable to me to simply recommend in the text of UCA that if an implementation needs UB functionality that the noncharacter U+10FFFF will work just fine. (In fact, in many implementations, U+FFFF will work as well-- and has the advantage that it works for both multi-level comparisons and for binary comparisons.) Problem 2: Nontailorability Claiming that whatever we choose for LB and UB, those characters would then have to become untailorable is a problem, because it introduces an architecturally new concept into the table and into UCA. To date, the algorithm has not tried to have formal constraints on what characters could be tailored. Introducing such concepts at this point is particularly troubling, because we would also have to figure out how to carry them over into ISO 14651, or countenance desynching the standards. There is no mechanism in 14651, either, for declaring that some particular character weights in the table cannot have their weights changed by a tailoring. Conclusion Given the problems, I think the reasonable thing to do is to give up on designating specific Unicode characters for LB and UB, and instead to explain the *functions* LB and UB in the text of the UCA, with some implementation suggestions regarding how to approach implementing those functions. I don't think we need to have actual weights in the DUCET table for this. .