RE: Proposing UTF-21/24

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 22 2007 - 15:59:39 CST

  • Next message: Mark Davis: "Re: Regulating PUA."

    Ruszlan said:

    > Uh... actually my main point was to devise a fixed-length encoding
    > for Unicode that wouldn't waste 1 octet out of 4 and could make
    > some use of the remaining 3 spare bits.

    And I don't think anyone is disputing that you have done that.
    But...

    "Wasting" 1 octet out of 4 is a non-issue. If text storage is
    the concern, as Mark already pointed out, then either UTF-8 or
    UTF-16 is going to be more efficient than UTF-21A or UTF-24A.
    Since Unicode text consists of the same or fewer bytes in either UTF-8
    or UTF-16 for all but the most contrived of texts, it will also be more
    efficient for interchange -- which is the real bottleneck, rather
    than raw storage space per se.

    UTF-21/24 is only a compression win when compared against UTF-32,
    but implementers *already* have better options in UTF-8 or UTF-16,
    if they are counting bytes.

    The counter is that UTF-21/24 is fixed-width -- unlike UTF-8
    and UTF-16 -- which is true. But that benefit is an
    illusion, because it isn't a true processing form. Characters
    in UTF-21/24 aren't integral values, but sequences of 3 bytes
    that have to be unpacked for interpretation. Effectively you
    have to turn them into UTF-32 for processing *anyway*, and
    since modern processor architectures use 32-bit or 64-bit registers,
    and *don't* use 24-bit registers, you aren't doing anybody
    any favors by using a "packed" form for the characters which
    have to be unpacked into integral values for decent processing
    characteristics in the first place.

    Finally, "mak[ing] some use of the remaining 3 spare bits" I
    consider to be a cardinal error in character encoding design.
    Using bits in a character encoding form for anything else than
    representation of the integral values of the encoded characters
    is a guarantee to additional complexity, errors, and opportunities
    for data corruption in handling text data. Putting "parity bits"
    into character streams is just bad design. Data integrity
    for streaming data should be handled instead by
    *data* protocols that handle the problem generically for *all*
    data, including any embedded character data.

    > I wouldn't say that cutting storage requirements by 25% is insignificant.

    Except that your comparison is against the strawman of storing all
    Unicode text as raw UTF-32 data, which is almost never the case.

    > And consider convenience of fixed-length format for many text-processing
    > tasks - you can enumerate characters by simply enumerating the octets,
    > without having to perform additional checks for multibyte sequences or
    > surrogate pairs each time. This could save a lot of computation on
    > long texts.

    This argument won't fly, because it presumes that such text
    stored that way can be appropriate at the same time for *other*
    text-processing tasks, which is manifestly not the case. You can always
    create special-purpose formats that perform well for some particular
    algorithmic requirement (here: enumerating characters). But such formats
    fail as general-purpose formats for text if they introduce countervailing
    complexity in other tasks, as UTF-21/24 would.

    > Again, if space requirement is a big issue and fixed-length properties
    > are not required, UTF-21A data can easily be converted to the
    > variable-length format proposed by Frank Ellermann, and then,
    > just as easily converted back when fixed-length is preferred over
    > saving space.

    And this is frankly no gain over doing UTF-8/UTF-32 conversions, with
    the net result of better compression *and* ASCII compatibility for
    the UTF-8 form and cleaner processing for the UTF-32 form.

    > Well, UTF-24A should be regarded as an extension of UTF-21A that
    > provides a built-in error detection mechanism if required.

    It isn't (required).

    > Once validated, UTF-24A data can be processed as UTF-21A by
    > simply ignoring the most significant bits of each octet. After
    > the text was modified, inserted characters would be easy to detect
    > by simply checking the most significant bit of the most significant
    > octet, so parity will have to be recalculated only for those code
    > units.

    And that is the kind of mixing of levels that no text processing
    algorithm should get involved in.

    > Again, if data integrity and byte order is not a concern, the
    > text can be converted to UTF-21A by simply resetting all 8th bits
    > to 0.

    At which point, UTF-21A loses all self-synchronization properties,
    making it worse than UTF-8 or UTF-16 in that regard. You get
    a fixed-width but *multi*-byte encoding with two possible
    incorrect registrations of character edges. So a byte error in
    handling can destroy the *entire* text.

    <U+4E00, U+4E8C, U+4E09, U+56DB>
      Chinese for "yi1, er2, san2, si4" '1, 2, 3, 4'

    UTF-21A -->

    <01 1C 00 01 1D 0C 01 1C 09 01 5D 5B>

    If you lost the 2nd byte by an error, then the resulting
    byte sequence:

    <01 00 01 1D 0C 01 1C 09 01 5D 5B>

    would reconvert to:

    <U+4001, U+74601, U+70481, ???>

    in other words, an unrelated Chinese character, two unassigned
    code points on plane 7, and a conversion error
    for the last two bytes. This on top of the fact that the
    UTF-21A string would have other troubles in much char-oriented
    string handling, because of the embedded null bytes and other
    byte values unrelated to their ASCII values. (You could, of
    course fix that by setting the high bit to 1 for all octets,
    instead, giving you yet another flavor of UTF-21A, but anyway...)

    Compare UTF-8 -->

    <E4 B8 80 E4 BA 8C E4 B8 89 E5 9B 9B> (same number of bytes, notice)

    If you lost the 2nd byte by an error, then the resulting
    byte sequence:

    <E4 80 E4 BA 8C E4 B8 89 E5 9B 9B>

    would reconvert to:

    <???, U+4E8C, U+4E09, U+56DB>

    Because UTF-8 is self-synchronizing, the loss of data is
    localized and doesn't propagate down the string.

     
    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 16:02:21 CST