L2/04-319 Source: Mark Davis Subject: Collation Upper and Lower Bounds Date: Mon, 2 Aug 2004 15:11:55 -0700 ========================== For collation, we have found that it is useful to have two fixed characters that have the following properties in collation, across any tailoring. That is, they would be guaranteed to untailorable, and always have these properties: LB: is a character that sorts above all ignorable characters and below all other non-ignorable characters. Thus it has a non-zero primary weight that is less than any other character's (of those characters having non-zero primary weights). UB: is a character that sorts above all other characters. Thus it has a primary weight that is greater than any other character's. So why have these characters? There are two different usages. For LB, if someone wants to sort multiple database fields in a single key, the LB is used to distinguish where one field stops and another begins. There is more detail on this in http://www.unicode.org/reports/tr10/#Interleaved_Levels, with examples. For the second, if someone wants to do a select on a range between two strings, say "Jones" to "Johnson", and wants to include all strings that "begin" with Johnson, then the user can pass in "Johnson" as the upper end of the range. It's not needed on the lower end, because shorter strings sort before longer ones. Control characters that are either unused or would not occur in the middle of plain text might be the best possibility for some of these. Here are some suggestions: LB: U+0002 START OF TEXT U+001F UNIT SEPARATOR U+009C STRING TERMINATOR UB: U+0003 END OF TEXT U+0004 END OF TRANSMISSION U+0019 END OF MEDIUM Currently ICU uses API to get the functionality of LB, and uses a private use character for UB (U+10FFFF). This works perfectly fine for us. But the above requirements are quite general, so we believe that it would be generally useful to have two distinguished characters with the same meaning across different implementations.