UTF-32s

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Tue May 29 2001 - 11:28:40 EDT


Billancourt, le 1er avril 2001,

I was thinking about this while reading the thread about UTF-8s.
If the binary order of UTF-16 is of so prime interest that the
(numerous) users of UTF-8 should slightly modify their code
to co-operate with UTF-16-based database engines, by
accepting UTF-8s rather than UTF-8 on input (which is a minor
annoyance), and sending UTF-8s rather than UTF-8 for the 4-byte
sequences (again, this is rather easy to achieve, thanks to the
easy-to-notice barrer), then I believe the (seldom) users of
UTF-32 should be prepared to have to modify their code when
the problem will surface for them too (clearly, at the moment it
doesn't).

So I suggest to correct the problem before it came out.
And I would like to propose UTF-32s.

Since there is a lot of unused space in UTF-32, it is easy to solve
the problem: you just need to "shift" the "incorrectly sorted"
characters into the "correct" place.

A first solution was to specify that every character from the planes
1-16 to be encoded in UTF-32s as a pair of 32-bit values,
the first one being of the form 0000D8xx/0000DBxxxx, and
the second of the form 0000DCxx/0000DFxx. Of course, the
relationship is the same as with UTF-16.
The advantage of this "solution" is that it is then trivial to map
from UTF-32s to UTF-16 and vice versa.
The main problem, however, is that it loses the principal
characteristic of UTF-32, the fact that characters are of fixed
length. This is clearly unacceptable (?)

So instead, I propose to shift the characters U+E000 to U+FFFD,
toward the position U-0011Exxx/U-0011Fxxx.
Yes, it is clearly a hack, and it does add some complexity for
BMP characters while doing nothing for the others ones which
are supposed to be less useful. However, it is quite easy to
convert the datas (the "most" difficult is the conversion from
plain UTF-32 to UTF-32s, because it needs a cap-and-floor
comparison to detect the characters in the U+E000 to U+FFFD
range).

Now, the astute reader certainly has remarked that one can
conceive a variant of UTF-8s which is only 4 byte long (instead
of 6) for the surrogates, while still preserving the sacred
binary order of UTF-16: just apply the standard algorithm
for UTF-8, but taking as input the UTF-32s code. As a result:

     codepoints UTF-32s UTF-8s'
  U+0000 .. 007F 00000000 .. 0000007F 00 .. 7F
  U+0080 .. 07FF 00000080 .. 000007FF C0 .. DF + 80 .. BF
  U+0800 .. D7FF 00000800 .. 0000D7FF Ex + 80..BF + 80..BF
 U+10000 ..4FFFD 00010000 .. 0004FFFD F0+80..BF+80..BF+80..BF
 U+50000 ..8FFFD 00050000 .. 0008FFFD F1+80..BF+80..BF+80..BF
 U+90000 ..CFFFD 00090000 .. 000CFFFD F2+80..BF+80..BF+80..BF
 U+D0000 ..10FFFD 000D0000 .. 0010FFFD F3+80..BF+80..BF+80..BF
  U+E000 .. FFFD 0011E000 .. 0011FFFD F4+9E..9F+80..BF+80..BF

Antoine



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT