From: Shawn Steele (Shawn.Steele@microsoft.com)
Date: Sun Feb 27 2011 - 23:23:44 CST
Interoperability trumps size by far for web content. Half the content isn't utf8 because of legacy issues, nothing to do with size. The markup of an html page is ascii, typically, and any image is far bigger than the page content.
Sent from my Windows Phone 7
Sent: Sunday, February 27, 2011 1:36 PM
Subject: Re: UTF-12!
From: Thomas Cropley (email@example.com)
Many thanks to everybody for their comments on UTF-c, especially to Philippe
Verdy. I have been reading them all with much interest.
First of all I would like to clarify what motivated me to developed UTF-c.
Some time ago I read that only about half the pages on the internet were
encoded in UTF-8, and this got me wondering why. But then I imagined myself
as say a native Greek where I knew characters of my native language could be
encoded in one byte using a Windows or ISO code-page, instead of two bytes
in UTF-8. In that situation I would choose to use a code-page encoding most
of the time and only use UTF-8 if it was really needed. It was obvious that
it would be preferable to have as few character encodings as possible, so
the next step was to see if one encoding could handle the full Unicode
character set, and yet be as efficient or almost as efficient as one byte
per character code-pages. In other words I was trying to combine the
advantages of UTF-8 with Windows/ISO/ISCII etc. code-pages.
My first attempt to solve this problem, which I called UTF-i, was a stateful
encoding that changed state whenever the first character of a non-ASCII
alphabetic script was encountered (hereafter I will call a character from a
non-ASCII alphabetic script a paged-character). It didn't require any
special switching codes because the first paged-character in a sequence was
encoded in long form (ie. two or three bytes) and only the following
paged-characters were encoded in one byte. When a paged-character from a
different page was encountered, it would be encoded in long form, and the
page state variable would change. The ASCII page was always active so there
was no need to switch pages for punctuation or spaces. So in effect it was a
dynamic code-page switching encoding. It had the advantage of not requiring
the use of C0 control characters or a file prefix (magic number, pseudo-BOM,
signature). It didn't take me long to reject UTF-i though, because although
it would have been suitable for storage and transmission purposes, trying to
write software like an editor or browser for a stateful encoding like UTF-i
would be a nightmare.
It now occurs to me that I may have been too hasty in my rejection of UTF-i.
If I was writing an application like a text editor or browser which could
handle text encoded in multiple formats, the only sensible approach would be
convert the encoded text to an easily processed form (such as a 16-bit
Unicode encoding) when the file was read in, and to convert back again when
the file is saved. It seems to me that Microsoft has adopted this approach.
So the stateful encoded text only needs to be scanned in the forward
direction from beginning to end, and thus most of the difficulties are
avoided. The only inconvenience I can see is for text searching applications
which scan stored files for text strings.
The other issue which needs to be addressed, is how well the text encodings
handle bit errors. When I designed UTF-c I assumed that bit errors were
extremely rare because modern storage devices and transmission protocols use
extensive error detection and correction. Since then I have done some
reading and it seems I may have been wrong. Apparently present day hardware
manufacturers of consumer electronics try to save a few dollars by not
including error detection of memory. There also seems to be widely varying
estimates of what error rate you can expect. This is important because it
would not be prudent to design a less efficient encoding that can tolerate
bit errors well if bit errors are extremely rare, or on the other hand if
errors are reasonably common, then the text encoding needs to be able to
localize the damage.
Finally I would prefer not to associate the word "compression" with UTF-c or
UTF-i. To most people compression schemes are inherently complicated, so I
would rather describe them as "efficient text encodings". Describing SCSU
and BOCU-1 as compression schemes may be the reason for the lack of
enthusiasm for those encodings.
I have downloaded some sample UTF-c pages to this site
http://web.aanet.com.au/tec if you interested in testing if your server and
browser can accept them.
Well heck, we might as well just define UTF-12:
0000 0000 000q qzzz yyyy yywx xxxx xxxx
00wx xxxx xxxx U+0000..U+0400
010z zzyy yyyy-10wx xxxx xxxx U+0400..U+7FFFF
011u uuyy yyyy-10wx xxxx xxxx U+80000..U+CFFFF
110v vvyy yyyy-10wx xxxx xxxx U+D0000..U+10FFFF
111x xxxx xxxx U+0400..U+10FFFF; same first fifteen bits as previous 2-"byte" character
First two bits:
00=single "byte" character
01=first "byte" of multi-byte character < U+D0000 (third bit '1' indicates > U+80000)
110=first "byte" of plane 13+ two byte character
10=last "byte" of multi-byte character
111=second "byte" of character in same 512 code point range as previous > U+0400 character
Stores supplementary planes with 24 bits instead of the 32 needed in UTF-16, UTF-32, UTF-8, etc.
Same size as UTF-8 (no more than 24 bits) for Plane 0 > U+0800 but more efficient (12 vs 16 bits) for U+007F..U+0400. For texts written in most alphabetic/abjad/brahmi scripts, there is near parity to Basic Latin through the use of "repeat" byte codes and presence of common punctuation and diacritics < U+0400.
Each byte indicates the size of the character, and which byte of the character it is, unlike UTF-8 - 10xx xxxx can be the second, third, or fourth byte, and you must look back to a 11xx xxxx byte to determine whether you must look forward to complete the character.
For the forseeable future, 011x X X will indicate unassigned planes, and the first byte of multi-byte characters can be quickly scanned for any code points in an unassigned plane.
1101 X X unambiguously indicates one of the private use planes.
Who the heck has 12 bit bytes?
Continuing byte requires backtracking an arbitrary distance to establish baseline range of character and introduces two encodings for any character > U+0400, depending on the previous character outside the first 1k code points.
PS, if anyone responds to this seriously, please learn about sarcasm.
This archive was generated by hypermail 2.1.5 : Sun Feb 27 2011 - 23:30:35 CST