Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: DougEwell2@cs.com
Date: Mon May 28 2001 - 06:30:13 EDT

Next message: DougEwell2@cs.com: "Unicode-based Cyrillic-Latin transliteration table"
Previous message: Marco Cimarosti: "Question about UTR#24"
Maybe in reply to: Peter_Constable@sil.org: "ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Next in thread: Carl W. Brown: "RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Reply: Carl W. Brown: "RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 2001-05-26 16:00:47 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> The issue is this: Unicode's three encoding forms don't sort in the same
> way when sorting is done using that most basic and
> valid-in-almost-no-locales-but-easy-and-quick approach of simply comparing
> binary values of code units. The three give these results:
>
> UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)

First, everyone take a breath and say it out loud: "UTF-16 is a hack."
There, doesn't that feel better? Whether it is necessary, beneficial, or
unavoidable is beside the point. Using pairs of 16-bit "surrogates" together
with an additive offset to refer to a 32-bit value may be a clever solution
to the problem, but it is still a hack, especially when those surrogate
values fall in the middle of the range of normal 16-bit values as they do.

UTF-8 and UTF-32 should absolutely not be similarly hacked to maintain some
sort of bizarre "compatibility" with the binary sorting order of UTF-16.
Anyone who is using the binary sorting order of UTF-16, and thus concludes
that (pardon the use of 10646 terms here) Planes 1 through 16 should be
sorted after U+D7FF but before U+E000 is really missing the point of proper
collation. I would state the case even more strongly than Peter, to say that
such a collation order is valid in NO locale at all.

If developers expect to sort Unicode text in any meaningful way, they should
be using the Unicode Collation Algorithm (UAX #10). Using strict code point
order as a basis for sorting is generally not appropriate, and applying the
UTF-16 transformation as a further basis for sorting only compounds the error.

UTC should not, and almost certainly will not, endorse such a proposal on the
part of the database vendors.

-Doug Ewell
Fullerton, California

Next message: DougEwell2@cs.com: "Unicode-based Cyrillic-Latin transliteration table"
Previous message: Marco Cimarosti: "Question about UTR#24"
Maybe in reply to: Peter_Constable@sil.org: "ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Next in thread: Carl W. Brown: "RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Reply: Carl W. Brown: "RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT