L2/04-319

Source: Mark Davis
Subject: Collation Upper and Lower Bounds
Date: Mon, 2 Aug 2004 15:11:55 -0700

==========================

For collation, we have found that it is useful to have two fixed characters
that have the following properties in collation, across any tailoring. That
is, they would be guaranteed to untailorable, and always have these
properties:

LB: is a character that sorts above all ignorable characters and below all
other non-ignorable characters. Thus it has a non-zero primary weight that
is less than any other character's (of those characters having non-zero
primary weights).

UB: is a character that sorts above all other characters. Thus it has a
primary weight that is greater than any other character's.

So why have these characters? There are two different usages. For LB, if
someone wants to sort multiple database fields in a single key, the LB is
used to distinguish where one field stops and another begins. There is more
detail on this in http://www.unicode.org/reports/tr10/#Interleaved_Levels,
with examples.

For the second, if someone wants to do a select on a range between two
strings, say "Jones" to "Johnson", and wants to include all strings that
"begin" with Johnson, then the user can pass in "Johnson<UB>" as the upper
end of the range. It's not needed on the lower end, because shorter strings
sort before longer ones.

Control characters that are either unused or would not occur in the middle
of plain text might be the best possibility for some of these. Here are some
suggestions:

LB:
U+0002 START OF TEXT
U+001F UNIT SEPARATOR
U+009C STRING TERMINATOR

UB:
U+0003 END OF TEXT
U+0004 END OF TRANSMISSION
U+0019 END OF MEDIUM

Currently ICU uses API to get the functionality of LB, and uses a private
use character for UB (U+10FFFF). This works perfectly fine for us. But the
above requirements are quite general, so we believe that it would be
generally useful to have two distinguished characters with the same meaning
across different implementations.