OK, the eXperimental Transformation Format goes thus (I didn't make it clear
enough):
C0, G0, G1 and NBSP (0xA0) stay the same: a single byte.
All Unicode characters from U+00A1 onwards are encoded in three bytes, the
first of which is in the range C2..FE, the other two A1..C1.
Thus U+00A1 = 0xC2 0xA1 0xA1
Advantages:
1. ASCII compatibility
2. C1 compatibility
3. Can be reduced to 7-bit SI/SO scheme with no control code overlap, thus
being a UTF-7 without the real UTF-7's chief disadvantage of no sync.
Disadvantages:
1. No simple way of filling bits like UTF-8's 110xxxxx 10xxxxxx. I suppose
this brings us back to UTF-1's modulo complexities...
2. 3 bytes for all Unicode characters above U+00A0.
3. UTF-16 surrogate piggybacking - 6 bytes per outside-BMP codepoint. Really
yucky, but those characters are rare.
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.
This archive was generated by hypermail 2.1.2 : Wed Jun 19 2002 - 02:08:48 EDT