RE: Proposing UTF-21/24

From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Sun Jan 21 2007 - 07:41:56 CST

  • Next message: Ruszlan Gaszanov: "RE: Proposing UTF-21/24"

    Frank Ellermann wrote:

    > Some of your arguments like "won't need a BOM anymore" don't make
    > sense for me...

    Well, since conversions between UTF-21/24 and UTF-32 (and UTF-16 for BMP characters) is very trivial - much more so then with UTF-8,
    some applications designers might prefer to use the same byte order for UTF-21/24 as they are using for UTF-16/32 in order to make
    processing faster. Hence we might get BE/LE varieties of UTF-21/24 and have to deal with BOM issue. Therefore, the error dedection
    mechanisms I proposed for UTF-24 varieties also allow automatic byte order detection.

    > One disadvantage of your scheme, unlike UTF-8 it can't be directly
    > expressed in CharmapML, the parity bit destroys simple patterns,
    > and an enumeration of 2**21 (minus surrogates) code points won't
    > fly.

    Well, all proposed UTF-24 varieties, while useful for long term storage and interchange, might not be very well suited for actual
    text processing in their pure form, since the presence of parity bits (in A and B) or resequenced combinations (in B and C) would
    make some otherwise trivial tasks computation-intense. However, algorithmic conversion of UTF-24A to either UTF-21A or UTF-32 is
    very trivial:

    utf21a = utf24a & 0x7F7F7F
    utf32 = (utf24a & 0x7F) | ((utf24a & 0x7F00) >> 1) | ((utf24a & 0x7F0000) >> 2)

    Recalculating the parity bit (XORing of 21 data bits with each other), when converting back to UTF-24A, is not a very
    computation-intense task either.

    Therefore it would make much more sense to use either UTF-32 or UTF-21A for internal processing (each code unit would still become a
    32-bit dword on 32/64-bit architecture) while storing and interchanging data in UTF-24A format (similarly to 7-bit ASCII, where
    parity bit was reset for internal processing and then recalculated for storage/transmission purposes - so we don't take it into
    account when making conversion tables for US-ASCII).

    Conversions from/to UTF-21B, UTF-24B and UTF-24C requires a bit more processing, but those are special purpose encoding schemes for
    restricted environments and conversion algorithms are no more complex (if not less complex) then for other similar purpose encoding
    schemes, like Java-UTF-8 and UTF-7.

    Note, that UTF-24A retains fixed-length properties of UTF-32 (while requiring less space) and provides built-in error detection
    mechanism like UTF-8 (while generally requiring less processing and even beating it in terms of space consumption for East-Asian
    texts). Although UTF-24A can't beat UTF-16 in terms of either processing or space requirements for BMP-only texts, it might be much
    more attractive for the texts making extensive use of characters outside BMP, since we won't have to deal with surrogates.

    Ruszlan



    This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 07:44:12 CST