Re: ASCII and Unicode lifespan

From: Peter Kirk (
Date: Fri May 20 2005 - 06:41:35 CDT

  • Next message: Peter Kirk: "Re: ASCII and Unicode lifespan"

    On 20/05/2005 02:36, Dean Snyder wrote:

    > ...
    >I can, for example, see a future when 32 bit characters are the minimum
    >standard and all hardware dealing with text has the same endianness -
    >the current default, big endian ;-) In such environments, multiple text
    >encoding forms and schemes and BOMs will be superfluous.
    In an environment in which all text is represented by 32-bit entities,
    endianness is also superfluous, or meaningless, and the fighters of
    Lilliput can lay down their weapons at last.


    >>>Probably the single most important, and extremely simple, step to a
    >>>better encoding would be to force all encoded characters to be 4 bytes.
    >>Naive in the extreme. You do realize, of course, that the entire
    >>structure of the internet depends on protocols that manipulate
    >>8-bit characters, with mandated direction to standardize their
    >>Unicode support on UTF-8?

    Actually, much of the Internet infrastructure can still deal only with
    7-bit characters, as we have been discussing on another thread. In order
    to carry 8-bit data, whether legacy encoded or UTF-8, across the
    Internet, it is apparently necessary to insert a low level "Quoted
    Printable" encoding layer to recode any bytes with the top bit set as
    three characters, leading to gross inefficiency in transmission of
    anything other than ASCII text - any UTF-8 encoded Unicode character
    beyond U+0080 is transmitted as between six and twelve bytes in this
    encoding. If we can tolerate this kind of extra layer to carry 8-bit
    character based data on a 7-bit medium, surely we can tolerate a similar
    layer to carry 32-bit character data on a 7-bit or 8-bit medium, for a
    transitional period until the Internet or its successor is upgraded to
    support 32-bit data at its lowest levels. And it should be possible to
    devise a suitably efficient encoding which is a lot less inefficient
    than UTF-8 over "Quoted Printable". Well, of course UTF-7 and UTF-8 are
    suitable encodings, but I am understanding them here as being used as
    content transfer encodings rather than as character sets.

    Peter Kirk (personal) (work)
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.322 / Virus Database: 266.11.13 - Release Date: 19/05/2005

    This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 06:42:29 CDT