UTF-c, UTF-i

From: Thomas Cropley (tomcropley@gmail.com)
Date: Sat Feb 26 2011 - 21:37:12 CST

  • Next message: Doug Ewell: "Re: UTF-c, UTF-i"

    Many thanks to everybody for their comments on UTF-c, especially to Philippe
    Verdy. I have been reading them all with much interest.

    First of all I would like to clarify what motivated me to developed UTF-c.
    Some time ago I read that only about half the pages on the internet were
    encoded in UTF-8, and this got me wondering why. But then I imagined myself
    as say a native Greek where I knew characters of my native language could be
    encoded in one byte using a Windows or ISO code-page, instead of two bytes
    in UTF-8. In that situation I would choose to use a code-page encoding most
    of the time and only use UTF-8 if it was really needed. It was obvious that
    it would be preferable to have as few character encodings as possible, so
    the next step was to see if one encoding could handle the full Unicode
    character set, and yet be as efficient or almost as efficient as one byte
    per character code-pages. In other words I was trying to combine the
    advantages of UTF-8 with Windows/ISO/ISCII etc. code-pages.


    My first attempt to solve this problem, which I called UTF-i, was a stateful
    encoding that changed state whenever the first character of a non-ASCII
    alphabetic script was encountered (hereafter I will call a character from a
    non-ASCII alphabetic script a paged-character). It didn't require any
    special switching codes because the first paged-character in a sequence was
    encoded in long form (ie. two or three bytes) and only the following
    paged-characters were encoded in one byte. When a paged-character from a
    different page was encountered, it would be encoded in long form, and the
    page state variable would change. The ASCII page was always active so there
    was no need to switch pages for punctuation or spaces. So in effect it was a
    dynamic code-page switching encoding. It had the advantage of not requiring
    the use of C0 control characters or a file prefix (magic number, pseudo-BOM,
    signature). It didn't take me long to reject UTF-i though, because although
    it would have been suitable for storage and transmission purposes, trying to
    write software like an editor or browser for a stateful encoding like UTF-i
    would be a nightmare.

    It now occurs to me that I may have been too hasty in my rejection of UTF-i.
    If I was writing an application like a text editor or browser which could
    handle text encoded in multiple formats, the only sensible approach would be
    convert the encoded text to an easily processed form (such as a 16-bit
    Unicode encoding) when the file was read in, and to convert back again when
    the file is saved. It seems to me that Microsoft has adopted this approach.
    So the stateful encoded text only needs to be scanned in the forward
    direction from beginning to end, and thus most of the difficulties are
    avoided. The only inconvenience I can see is for text searching applications
    which scan stored files for text strings.


    The other issue which needs to be addressed, is how well the text encodings
    handle bit errors. When I designed UTF-c I assumed that bit errors were
    extremely rare because modern storage devices and transmission protocols use
    extensive error detection and correction. Since then I have done some
    reading and it seems I may have been wrong. Apparently present day hardware
    manufacturers of consumer electronics try to save a few dollars by not
    including error detection of memory. There also seems to be widely varying
    estimates of what error rate you can expect. This is important because it
    would not be prudent to design a less efficient encoding that can tolerate
    bit errors well if bit errors are extremely rare, or on the other hand if
    errors are reasonably common, then the text encoding needs to be able to
    localize the damage.


    Finally I would prefer not to associate the word "compression" with UTF-c or
    UTF-i. To most people compression schemes are inherently complicated, so I
    would rather describe them as "efficient text encodings". Describing SCSU
    and BOCU-1 as compression schemes may be the reason for the lack of
    enthusiasm for those encodings.


    I have downloaded some sample UTF-c pages to this site
    http://web.aanet.com.au/tec if you interested in testing if your server and
    browser can accept them.


    This archive was generated by hypermail 2.1.5 : Sat Feb 26 2011 - 21:42:20 CST