Subject: Binary order in different UTFs.
Author: Asmus Freytag
Date: 5/20/01

Someone suggested fixing UTF-16.

Let's propose a UTF-16S where the surrogates are in the correct place for binary comparison. I.e. the high surrogates would need to be in the range FC00..FFFF instead of D800..DBFF.

To conserve code space, the characters that are now in DC00..FFFF would be placed in D800..FBFF. Given the type of characters that are in the range DC00..FFFF hardly anyone might notice ;-).

To map characters from UTF-16 to UTF-16S you proceed as follows

typedef unsigned short utf_16;

utf_16 ch;

if (ch >= DC00)
ch -= 0400;
else if (ch >= D800)
ch |= FC00;

Now this can be done on the fly for an 'almost' binary comparison:

enum result { LESSER = -1, SAME, GREATER };

enum result compare_utf16s(utf_16 ch1, utf_16 ch2)

if (ch1 == ch2)
return SAME;

if (ch1 >= D800)
if (ch1 >= DC00)
if (ch2 >= DC00)
return ch1 > ch2 ? GREATER : LESSER;
else if (ch2 >= D800)
return LESSER;
return GREATER;
if ((ch2 & FC00) == D800)
return ch1 > ch2 ? GREATER : LESSER;
return GREATER;
return ch1 > ch2 ? GREATER : LESSER;

I have left out the C-style 0x from the hex numbers for readability.

This should compare the same as UTF-32 in binary comparison. For the overwhelming majority of characters (weighted by usage) this requires one additional test per comparison. That seems such a reasonable cost that it's hardly worth introducing UTF-16S as an actual encoding form. When comparing strings that have identical initial substrings, the characteristics get even better as the test in the loop would be unaffected. For example, replace the first if above by this code:

while(ch1 && (ch1 == ch2))
advance both inputs;

if (!ch1 && !ch2)
return SAME;

... continue as above

The test in the loop is independent of whether this is a true binary comparison on UTF-16 or an on-the-fly comparison of UTF-16 as if it
was UTF-16S.

compare_utf16s does not attempt to do anything particularly meaning- ful about unpaired surrogates, as it should not be the task of such a low level routine to validate its input on that level.


PS: Here's the comparison matrix spelled out

ch1 / ch2 0-D7FF D800-DBFF DC00-FFFF



Here are the bit patterns spelled out

D800 = 1101 1000 0000 0000
DBFF = 1101 1011 1111 1111

1101 10xx xxxx xxxx
MASK 1111 1100 0000 0000 = FC00