XTF-3 Description, Advantages/Drawbacks

From: Shlomi Tal (shlompi@hotmail.com)
Date: Wed Jun 19 2002 - 03:37:49 EDT

OK, the eXperimental Transformation Format goes thus (I didn't make it clear

C0, G0, G1 and NBSP (0xA0) stay the same: a single byte.
All Unicode characters from U+00A1 onwards are encoded in three bytes, the
first of which is in the range C2..FE, the other two A1..C1.

Thus U+00A1 = 0xC2 0xA1 0xA1


1. ASCII compatibility
2. C1 compatibility
3. Can be reduced to 7-bit SI/SO scheme with no control code overlap, thus
being a UTF-7 without the real UTF-7's chief disadvantage of no sync.


1. No simple way of filling bits like UTF-8's 110xxxxx 10xxxxxx. I suppose
this brings us back to UTF-1's modulo complexities...

2. 3 bytes for all Unicode characters above U+00A0.

3. UTF-16 surrogate piggybacking - 6 bytes per outside-BMP codepoint. Really
yucky, but those characters are rare.

Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.

This archive was generated by hypermail 2.1.2 : Wed Jun 19 2002 - 02:08:48 EDT