Re: ASCII and Unicode lifespan

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 19 2005 - 14:54:03 CDT

  • Next message: Philippe Verdy: "Re: ASCII and Unicode lifespan"

    Dean Snyder suggested:

    > Here, off the top of my head, are some problems with Unicode which,
    > cumulatively, could prove its undoing:
    >
    > Needless complexity

    Complex, indubitably.

    But would you care to document the claim that the complexity
    is "needless"?

    > Stateful mechanisms

    For bidirectional text, yes.

    But all extant schemes for the representation of bidirectional
    text involve stateful mechanisms. Would you care to supplant
    the last decade's work by the bidirectional committee and
    suggest a non-stateful mechanism that meets the same requirements
    for the representation of bidirectional text?

    > No support for a clean division between text and meta-text

    Would you care to suggest replacements for such widely
    implemented W3C standards as HTML and XML?

    > Errors in actual content

    Well, there's that. But any list longer than 30 items generally
    has at least 1 error in it.

    Generations of Chinese scholars have spent 2500 years trying
    to get "the" list of Chinese characters correct. Never have,
    never will.

    > Legacy sludge

    This is the point on which I (and a number of other Unicode
    participants) are most likely to agree with you. The legacy
    sludge in Unicode was the cost of doing business, frankly.
    Legacy compatibility was what made the standard successful,
    because it could and can interoperate with the large number of bizarre
    experiments in character encoding which preceded it.

    At some point, probably measured more in decades than in years,
    the importance of all that legacy sludge will drop to the
    level of irrelevance except for dedicated archivists and
    digital archaeologists. When that happens, some bright,
    young generation is going to say, "Hey, we could clean all
    of that sludge out of Unicode and have a much more
    consistent and easier to implement character encoding
    standard. Whadya think? Should we try making it happen?"
    And chances are, they *will* make it happen, eventually.

    > Irreversibility

    Irreversibility is the nature of standards. Nothing is more
    harmful to a standard -- particularly a widely implemented
    standard -- than trying to retract things from it that have
    already been implemented. That is a fast track to fractionation
    into incompatible, non-interworking, de facto variants of the
    standard.

    > >How will the "something better" solve these problems without
    > >introducing new ones?
    >
    > Subsequent encoding efforts will be better because they will have
    > learned from the mistakes of earlier encoders ;-)

    Sure, but that doesn't answer Doug's question. You have simply
    *assumed* here that subsequent encoding efforts wouldn't end
    up introducing new problems.

    First of all, it should be obvious that any introduction of a
    new universal encoding will result in its own new "legacy"
    problem for figuring out how to deal with (by then) multi-petabytes
    of Unicode data, and with globally distributed software that manipulates
    text encoded in Unicode.

    > Probably the single most important, and extremely simple, step to a
    > better encoding would be to force all encoded characters to be 4 bytes.

    Naive in the extreme. You do realize, of course, that the entire
    structure of the internet depends on protocols that manipulate
    8-bit characters, with mandated direction to standardize their
    Unicode support on UTF-8?

    > >How will it meet the challenge of transcoding untold amounts
    > >of "legacy" Unicode data?
    >
    > Transcoding Unicode data into some new standard could at least be done
    > in ways similar to the ways pre-Unicode data is being transcoded into
    > Unicode now - an almost trivial pursuit.

    An "almost trivial pursuit" that employs hundreds of fulltime
    programmers, often working on very intractable problems.
    And these "trivial" problems don't go away. Every time somebody
    else decides that some standard isn't "irreversible" and needs
    to be fixed or extended, it creates another class of conversion
    problems to be dealt with to keep information technology chugging
    away. The latest nightmare has been dealing with GB 18030.

    > But I do
    > believe that hubris, intolerable in such matters, has unfortunately led
    > to short-sighted mistakes in both the architecture and content of
    > Unicode, mistakes Unicode is saddled with in perpetuity.

    Mistakes in content we can argue about, I suppose.

    But how has "hubris" led to "short-sighted mistakes in ... the
    architecture"?

    The most serious mistake I see in the architectural resulted from
    the need to assign surrogates at D800..DFFF, instead of F800..FFFF.
    But it wasn't "hubris" that led to the prior assignment of
    a bunch of compatibility characters at FE30..FFEF -- just a lack
    of foresight about the eventual form of the surrogate mechanism.
     
    > As just one example of the kind of architectural change that could drive
    > new encoding schemes, one could propose an encoding design that self-
    > references its own mutability, thereby redefining "stability" to include
    > not only extensibility but also reversibility. This would be
    > accomplished by dedicating as version indicators, e.g., 7 of the 32 bits
    > in every 4 byte character.

    Whew! You started off your list of problems that may prove the undoing
    of Unicode with "needless complexity". And the first architectural
    change you suggest is putting version indication stamps in 7 bits of
    32 bit characters?! Any software engineer I know would hoot such
    a proposal off the stage for introducing needless complexity into
    string processing. Sorry, but that one is a nonstarter.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 14:55:00 CDT