In a message dated 2001-05-26 16:00:47 Pacific Daylight Time,
> The issue is this: Unicode's three encoding forms don't sort in the same
> way when sorting is done using that most basic and
> valid-in-almost-no-locales-but-easy-and-quick approach of simply comparing
> binary values of code units. The three give these results:
> UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
First, everyone take a breath and say it out loud: "UTF-16 is a hack."
There, doesn't that feel better? Whether it is necessary, beneficial, or
unavoidable is beside the point. Using pairs of 16-bit "surrogates" together
with an additive offset to refer to a 32-bit value may be a clever solution
to the problem, but it is still a hack, especially when those surrogate
values fall in the middle of the range of normal 16-bit values as they do.
UTF-8 and UTF-32 should absolutely not be similarly hacked to maintain some
sort of bizarre "compatibility" with the binary sorting order of UTF-16.
Anyone who is using the binary sorting order of UTF-16, and thus concludes
that (pardon the use of 10646 terms here) Planes 1 through 16 should be
sorted after U+D7FF but before U+E000 is really missing the point of proper
collation. I would state the case even more strongly than Peter, to say that
such a collation order is valid in NO locale at all.
If developers expect to sort Unicode text in any meaningful way, they should
be using the Unicode Collation Algorithm (UAX #10). Using strict code point
order as a basis for sorting is generally not appropriate, and applying the
UTF-16 transformation as a further basis for sorting only compounds the error.
UTC should not, and almost certainly will not, endorse such a proposal on the
part of the database vendors.
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT