L2/05-302


Date: October 13, 2005

Title: Collation Upper- and Lower-Bound Characters

Source: Ken Whistler

Action: For discussion by the UTC


Background

In L2/04-319, Mark Davis suggested designating two (existing)
characters as having special weights in the context of
the Unicode Collation Algorithm. Detail can be seen in
L2/04-319, but in essence:

LB (Lower-bound):

Given a (non-ignorable) primary weight so that it sorts
above all ignorable characters, but below all other
non-ignorable characters.

UB (Upper-bound):

Given a primary weight greater than any other character.


Both of these concepts have valid applications in a database
context.

LB can be used as a delimiter in concatenating multiple fields 
into strings that then can be given a single key for comparison.
This has the effect of merged field comparison, as documented
in:

http://www.unicode.org/reports/tr10/#Interleaved_Levels


UB can be used to implement substring range comparison.
As stated in L2/04-319, if you want to do a Select statement
on a range "Bag" to "Biz", but you want to pick
up all entries that *begin* with the substring "Biz",
and not just "Biz" itself (i.e. "Biza", "Bizness", "Bizlbub",
etc.), then you can use "Biz" + <UB> as your upper range
comparator string. This works because "Biz<UB>" will compare
higher than any other string starting with "Biz".

L2/04-319 suggested several possible candidates for LB and
UB:

LB:
U+0002 START OF TEXT
U+001F UNIT SEPARATOR
U+009C STRING TERMINATOR

UB:
U+0003 END OF TEXT
U+0004 END OF TRANSMISSION
U+0019 END OF MEDIUM

And there are plenty of other possible candidate characters
that see little actual use, and which could be given special
weighting behavior such as this for UCA.

L2/04-319 did not work out the details of exactly how weights
would be assigned in DUCET for LB and UB, or even if they
would have to be assigned in DUCET per se. Given the structure
of weight assignments in DUCET, it would even be possible to
take a given version of DUCET, calculate the relevant values,
and assign weights for LB and UB at runtime.

However, L2/04-319 did propose that LB and UB would need to
be guaranteed to be untailorable, because the whole point is
that LB and UB be guaranteed to always be lower than (or
greater than, respectively) all other non-ignorable character
weights. If you don't constrain their tailoring, it would be
trivial to create a tailoring that would break those
relationships.

L2/04-319 concluded:

"Currently ICU uses API to get the functionality of LB, and
uses a private use character for UB (U+10FFFF). This works
perfectly fine for us. But the above requirements are quite
general, so we believe that it would be generally useful to
have two distinguished characters with the same meaning across
different implementations."

[Picking a nit: U+10FFFF is not a private-use character,
but a noncharacter.]

Having mulled this over for awhile, I see some problems
in the proposal that lead me to believe that proceeding with
the proposed approach of designating two (existing) characters
as LB and UB and weighting them accordingly in the DUCET
is not advisable.

Problem 1: Exactly what to do for weighting

The problem here is variable weight assignment in DUCET.
Abbreviating from the current table allkeys.txt, to highlight
the significant transition points, we have:

0000  ; [.0000.0000.0000.0000] # [0000] NULL (in 6429)
...
0009  ; [*0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
...
0020  ; [*0209.0020.0002.0020] # SPACE
0021  ; [*0253.0020.0002.0021] # EXCLAMATION MARK
...
10A47 ; [*0F34.0020.0002.10A47] # KHAROSHTHI NUMBER ONE THOUSAND
...
0332  ; [.0000.0021.0002.0332] # COMBINING LOW LINE
...
20EB  ; [.0000.017B.0002.20EB] # COMBINING LONG DOUBLE SOLIDUS OVERLAY
02D0  ; [.0F35.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON
...
2FA1D ; [.FB85.0020.0002.2A600][.A600.0000.0000.2A600] # CJK COMPATIBILITY 
IDEOGRAPH-2FA1D; QQC

The elements marked with a "*" are the variable collation elements.
Depending on the options chosen for key generation (Blanked
versus Shifted, for example), the primary weight for them
may either end up zeroed out or not.

*0201 is the lowest assigned variable collation element weight,
and *0F34 is the highest (in the current table). 0200 would
be the lowest possible primary weight in the table, but it
currently is not used.

U+20EB is the lowest *non*-ignorable character given a primary
weight, with its primary weight 0F35 starting just above that
of the highest primary weight for the variable collation elements.
And then primary weight range all the way up into the very
high constructed double element implicit weights for Han characters
(and unassigned code points -- which don't appear in the table).

The problem is this: there are two incompatible potential choices
for weighting LB, and neither of them would work for
all choices of key generation options for the variable weighted
characters.

Choice A: Give LB the weight 0F35 and shift all the subsequent
          weights in the table up by one.
          
This seems to be what was implied by the proposal in L2/05-319.
It would work for the Blanked or Shifted key generation options,
but if the Non-ignorable option were chosen, the LB weight would
no longer be the lowest primary, and comparisons would fail for
all of the characters with primary weights in the ranges 0201..0F34.

Choice B: Give LB the weight 0200.

This would, indeed, work for all the key generation options, since
even if you choose Non-ignorable, 0200 is guaranteed to be a lower
primary that that for the lowest variable collation element,
namely *0201 for TAB. The problem, however, is that this breaks
another design element of the table which was deliberately put in
place to enable table memory optimizations: all primary weight
values for "ignorable" characters (namely the variable collation
elements) are guaranteed to be lower than the lowest non-ignorable
primary weight. Since LB has to have a non-ignorable weight,
that relationship would be broken, and implementations which
optimize based on the range guarantees would also be broken.

There might be some other way to handle the weighting that would
preserve both the lowest primary guarantee for LB and the
range optimization guarantee, but I'm having trouble finding it.

But before attempting to hack further at the table to engineer
a solution, I have to wonder if it really is necessary to find
one.

ICU and other implementations handle the merged field problem
via API, rather than by using a formally designated character
for this function. Applications have traditionally been
handling the multi-field comparison problem for years by
simple concatenation schemes. Concatenating field values with
a delimiter character, typically some punctuation mark that
sorts lower than letters and digits, gets you the right answer
in most cases, as long as you choose a character that sorts
lower than any data you actually allow in the data fields.
You can even do the field delimitation with one of the control
codes that was designed for data field separation in the first place,
U+001F UNIT SEPARATOR.

The argument that the *character* for this has to be
interoperable strikes me as a weak one, because the
usefulness of the LB is in processing structured data
in database contexts. It is a transient delimitation used in
comparing multiple data fields, rather than the actual
*content* which would be entered or queried back from
the data. And I don't see it catching on as an alternative
for the existing protocols for flattening out fielded
data with delimiters for interchange as plain text.

The weight to assign for UB is also a bit of a problem.
The largest values normally assigned by the algorithm
come from the use of implicit weights for Han characters
and unassigned code points (and noncharacters). The
highest would be for U+10FFFF:

U+10FFFF => [.FBE1.0020.0002][.FFFF.0000.0000]

So you could guarantee that UB would sort higher than
that by choosing a primary weight > FBE1. However,
the BASE values for implicit weights aren't required to
be the exact values given in Section 7.1.3 of UCA,
and it is tricky to figure out how to keep the UB
weight stably highest while allowing implementations
to fiddle with BASE and implicit weight generation,
unless UB was simply assigned FFFF as a primary weight
and FFFF was otherwise disallowed in DUCET or weights
directly derived from it. 

The easiest thing to do is exactly what ICU does.
Just use the *non*character U+10FFFF internally as your
implementation's UB function. Unless you have done
something funky in implementing UCA, that will give you
your highest key value automatically.

And since, just as for LB, there is very little evidence that
UB needs to function as an interchangeable character in
plain text, it seems preferable to me to simply
recommend in the text of UCA that if an implementation
needs UB functionality that the noncharacter U+10FFFF
will work just fine. (In fact, in many implementations,
U+FFFF will work as well-- and has the advantage that it
works for both multi-level comparisons and for binary
comparisons.)


Problem 2: Nontailorability

Claiming that whatever we choose for LB and UB, those
characters would then have to become untailorable is
a problem, because it introduces an architecturally new
concept into the table and into UCA. To date, the algorithm
has not tried to have formal constraints on what characters
could be tailored.

Introducing such concepts at this point is particularly
troubling, because we would also have to figure out how
to carry them over into ISO 14651, or countenance desynching
the standards. There is no mechanism in 14651, either, for
declaring that some particular character weights in the
table cannot have their weights changed by a tailoring.


Conclusion

Given the problems, I think the reasonable thing to do is
to give up on designating specific Unicode characters for
LB and UB, and instead to explain the *functions* LB and
UB in the text of the UCA, with some implementation
suggestions regarding how to approach implementing
those functions. I don't think we need to have actual
weights in the DUCET table for this.

.