Re: ASCII and Unicode lifespan

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 19 2005 - 16:02:16 CDT

  • Next message: Rick McGowan: "Re: ASCII and Unicode lifespan"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Dean Snyder suggested:
    >> Stateful mechanisms
    >
    > For bidirectional text, yes.
    >
    > But all extant schemes for the representation of bidirectional
    > text involve stateful mechanisms. Would you care to supplant
    > the last decade's work by the bidirectional committee and
    > suggest a non-stateful mechanism that meets the same requirements
    > for the representation of bidirectional text?

    The only way I see to avoid stateful mechanisms with bidirectionnal scripts
    would have been to use the visual left-to-right order throughout the
    encoding. Needless to say, this still does not work well due to soft
    end-of-lines in the middle of paragraphs, or the whole RTL paragraph must be
    written in the opposite direction.

    >> No support for a clean division between text and meta-text
    >
    > Would you care to suggest replacements for such widely
    > implemented W3C standards as HTML and XML?

    May be he suggests that Unicode encodes non-characters for this purpose of
    delimiting textual and non-textual parts.

    >> Legacy sludge
    >
    > This is the point on which I (and a number of other Unicode
    > participants) are most likely to agree with you. The legacy
    > sludge in Unicode was the cost of doing business, frankly.
    > Legacy compatibility was what made the standard successful,
    > because it could and can interoperate with the large number of bizarre
    > experiments in character encoding which preceded it.

    Thanks, I also approve the fact that Unicode and ISO/IEC 10646 can coexist
    peacefully with all the many legacy encodings. Without it, conversions would
    have been a nightmare and as much unpredictable as between past legacy
    charsets. This means that almost all legacy charsets can be converted very
    simply to Unicode (the reverse is not necessarily true of course), acting as
    a compatible superset of allmost all these legacy charsets.

    (But effectively it's hard to convert ISO 2022 to Unicode without using
    stateful converters that know all the referenced charsets; and there are
    some difficulties to convert other Teletext standard charsets that model
    combining characters encoded BEFORE the base character, in a way similar to
    deadkeys on European keyboards, the converter needs some lookahead to
    reverse the encoded characters).

    >> >How will the "something better" solve these problems without
    >> >introducing new ones?
    >>
    >> Subsequent encoding efforts will be better because they will have
    >> learned from the mistakes of earlier encoders ;-)

    I hope this will not be a revolution, but mostly corrections to the
    character model, and a better definition of canonical equivalence (if such
    concept is still needed in the new standard, i.e. if it remains several
    equivalent ways to encode the same abstract characters or grapheme
    clusters).

    >> Probably the single most important, and extremely simple, step to a
    >> better encoding would be to force all encoded characters to be 4 bytes.
    >
    > Naive in the extreme. You do realize, of course, that the entire
    > structure of the internet depends on protocols that manipulate
    > 8-bit characters, with mandated direction to standardize their
    > Unicode support on UTF-8?

    I suppose he speaks about UTF-16, which may be deprecated effectively in
    some time. I doubt too that UTF-8 will be deprecated soon, given that it has
    no such difficulties like endianness problems (more or less solved using
    BOM).

    I would expect that only one form of UTF-32 will remain (most probably
    little-endian, given that most processors produced taday are now going
    little-endian, except Motorola/Apple/IBM PowerPC; but I wonder if PPC is not
    already prepared to work natively with little-endian numbers, for example
    with a endian-mode control bit set by the OS, as I don't know its
    architecture and assembly language).

    > The most serious mistake I see in the architectural resulted from
    > the need to assign surrogates at D800..DFFF, instead of F800..FFFF.
    > But it wasn't "hubris" that led to the prior assignment of
    > a bunch of compatibility characters at FE30..FFEF -- just a lack
    > of foresight about the eventual form of the surrogate mechanism.

    And what about the non-characters at xFFFE and xFFFF? Would you have
    assigned surrogates there? Then how would have we solved the endianness
    "problem" for UTF-16 and UTF-32 if xFFFE and xFFFF were not already
    non-characters, allowing the detection of BOM?

    My opinion is that UTF-16 will not survive in some long term (unlike UTF-8
    and UTF-32), when all processors will be 32-bit at least including in small
    mobile devices and utility appliances. So surrogates will no more be
    needed...

    We will still need a BOM for UTF-32 only (coded 00 00 FE FF or FF FE 00 00),
    as long as there will remain big-endian architectures. But still no place to
    put surrogates at end of the 16-bit code unit space.

    But I'm quite sure that something like UTF-24, with a fixed (little)
    enddianness (with BOM unneeded and illegal, and possibly with ignored
    trailing bytes for data alignment only in internal memory) may become
    popular for serialization of Unicode text on protocol streams. It will be
    simpler and faster to decode and allocate/store with predicatable sizes,
    than UTF-8 which uses variable sizes.



    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 16:03:42 CDT