From: Peter Kirk (peterkirk@qaya.org)
Date: Fri May 20 2005 - 06:41:35 CDT
On 20/05/2005 02:36, Dean Snyder wrote:
> ...
>
>I can, for example, see a future when 32 bit characters are the minimum
>standard and all hardware dealing with text has the same endianness -
>the current default, big endian ;-) In such environments, multiple text
>encoding forms and schemes and BOMs will be superfluous.
>
>
>
In an environment in which all text is represented by 32-bit entities,
endianness is also superfluous, or meaningless, and the fighters of
Lilliput can lay down their weapons at last.
...
>>>Probably the single most important, and extremely simple, step to a
>>>better encoding would be to force all encoded characters to be 4 bytes.
>>>
>>>
>>Naive in the extreme. You do realize, of course, that the entire
>>structure of the internet depends on protocols that manipulate
>>8-bit characters, with mandated direction to standardize their
>>Unicode support on UTF-8?
>>
>>
Actually, much of the Internet infrastructure can still deal only with
7-bit characters, as we have been discussing on another thread. In order
to carry 8-bit data, whether legacy encoded or UTF-8, across the
Internet, it is apparently necessary to insert a low level "Quoted
Printable" encoding layer to recode any bytes with the top bit set as
three characters, leading to gross inefficiency in transmission of
anything other than ASCII text - any UTF-8 encoded Unicode character
beyond U+0080 is transmitted as between six and twelve bytes in this
encoding. If we can tolerate this kind of extra layer to carry 8-bit
character based data on a 7-bit medium, surely we can tolerate a similar
layer to carry 32-bit character data on a 7-bit or 8-bit medium, for a
transitional period until the Internet or its successor is upgraded to
support 32-bit data at its lowest levels. And it should be possible to
devise a suitably efficient encoding which is a lot less inefficient
than UTF-8 over "Quoted Printable". Well, of course UTF-7 and UTF-8 are
suitable encodings, but I am understanding them here as being used as
content transfer encodings rather than as character sets.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.322 / Virus Database: 266.11.13 - Release Date: 19/05/2005
This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 06:42:29 CDT