Re: UTF-12!

From: vanisaac@boil.afraid.org
Date: Sun Feb 27 2011 - 15:13:18 CST

  • Next message: Shawn Steele: "RE: UTF-12!"

    From: Thomas Cropley (tomcropley@gmail.com)
    --------------------------------------------------------------------------------
    Many thanks to everybody for their comments on UTF-c, especially to Philippe
    Verdy. I have been reading them all with much interest.

    First of all I would like to clarify what motivated me to developed UTF-c.
    Some time ago I read that only about half the pages on the internet were
    encoded in UTF-8, and this got me wondering why. But then I imagined myself
    as say a native Greek where I knew characters of my native language could be
    encoded in one byte using a Windows or ISO code-page, instead of two bytes
    in UTF-8. In that situation I would choose to use a code-page encoding most
    of the time and only use UTF-8 if it was really needed. It was obvious that
    it would be preferable to have as few character encodings as possible, so
    the next step was to see if one encoding could handle the full Unicode
    character set, and yet be as efficient or almost as efficient as one byte
    per character code-pages. In other words I was trying to combine the
    advantages of UTF-8 with Windows/ISO/ISCII etc. code-pages.

    My first attempt to solve this problem, which I called UTF-i, was a stateful
    encoding that changed state whenever the first character of a non-ASCII
    alphabetic script was encountered (hereafter I will call a character from a
    non-ASCII alphabetic script a paged-character). It didn't require any
    special switching codes because the first paged-character in a sequence was
    encoded in long form (ie. two or three bytes) and only the following
    paged-characters were encoded in one byte. When a paged-character from a
    different page was encountered, it would be encoded in long form, and the
    page state variable would change. The ASCII page was always active so there
    was no need to switch pages for punctuation or spaces. So in effect it was a
    dynamic code-page switching encoding. It had the advantage of not requiring
    the use of C0 control characters or a file prefix (magic number, pseudo-BOM,
    signature). It didn't take me long to reject UTF-i though, because although
    it would have been suitable for storage and transmission purposes, trying to
    write software like an editor or browser for a stateful encoding like UTF-i
    would be a nightmare.

    It now occurs to me that I may have been too hasty in my rejection of UTF-i.
    If I was writing an application like a text editor or browser which could
    handle text encoded in multiple formats, the only sensible approach would be
    convert the encoded text to an easily processed form (such as a 16-bit
    Unicode encoding) when the file was read in, and to convert back again when
    the file is saved. It seems to me that Microsoft has adopted this approach.
    So the stateful encoded text only needs to be scanned in the forward
    direction from beginning to end, and thus most of the difficulties are
    avoided. The only inconvenience I can see is for text searching applications
    which scan stored files for text strings.

    The other issue which needs to be addressed, is how well the text encodings
    handle bit errors. When I designed UTF-c I assumed that bit errors were
    extremely rare because modern storage devices and transmission protocols use
    extensive error detection and correction. Since then I have done some
    reading and it seems I may have been wrong. Apparently present day hardware
    manufacturers of consumer electronics try to save a few dollars by not
    including error detection of memory. There also seems to be widely varying
    estimates of what error rate you can expect. This is important because it
    would not be prudent to design a less efficient encoding that can tolerate
    bit errors well if bit errors are extremely rare, or on the other hand if
    errors are reasonably common, then the text encoding needs to be able to
    localize the damage.

    Finally I would prefer not to associate the word "compression" with UTF-c or
    UTF-i. To most people compression schemes are inherently complicated, so I
    would rather describe them as "efficient text encodings". Describing SCSU
    and BOCU-1 as compression schemes may be the reason for the lack of
    enthusiasm for those encodings.

    I have downloaded some sample UTF-c pages to this site
    http://web.aanet.com.au/tec if you interested in testing if your server and
    browser can accept them.

                  Tom
    --------------------------------------------------------------------------------

    Well heck, we might as well just define UTF-12:

    for UTF-32:
    0000 0000 000q qzzz yyyy yywx xxxx xxxx

    where uuu=qqzzz-1000
    and vvv=qqzzz-1011

    UTF-12:
    00wx xxxx xxxx U+0000..U+0400
    010z zzyy yyyy-10wx xxxx xxxx U+0400..U+7FFFF
    011u uuyy yyyy-10wx xxxx xxxx U+80000..U+CFFFF
    110v vvyy yyyy-10wx xxxx xxxx U+D0000..U+10FFFF
    111x xxxx xxxx U+0400..U+10FFFF; same first fifteen bits as previous 2-"byte" character

    First two bits:

    00=single "byte" character
    01=first "byte" of multi-byte character < U+D0000 (third bit '1' indicates > U+80000)
    110=first "byte" of plane 13+ two byte character
    10=last "byte" of multi-byte character
    111=second "byte" of character in same 512 code point range as previous > U+0400 character

    Advantages:

    Stores supplementary planes with 24 bits instead of the 32 needed in UTF-16, UTF-32, UTF-8, etc.
    Same size as UTF-8 (no more than 24 bits) for Plane 0 > U+0800 but more efficient (12 vs 16 bits) for U+007F..U+0400. For texts written in most alphabetic/abjad/brahmi scripts, there is near parity to Basic Latin through the use of "repeat" byte codes and presence of common punctuation and diacritics < U+0400.

    Each byte indicates the size of the character, and which byte of the character it is, unlike UTF-8 - 10xx xxxx can be the second, third, or fourth byte, and you must look back to a 11xx xxxx byte to determine whether you must look forward to complete the character.

    For the forseeable future, 011x X X will indicate unassigned planes, and the first byte of multi-byte characters can be quickly scanned for any code points in an unassigned plane.

    1101 X X unambiguously indicates one of the private use planes.

    Disadvantages:

    Who the heck has 12 bit bytes?

    Continuing byte requires backtracking an arbitrary distance to establish baseline range of character and introduces two encodings for any character > U+0400, depending on the previous character outside the first 1k code points.

    -Van

    PS, if anyone responds to this seriously, please learn about sarcasm.



    This archive was generated by hypermail 2.1.5 : Sun Feb 27 2011 - 15:17:06 CST