Re: ASCII and Unicode lifespan

From: Dean Snyder (
Date: Thu May 19 2005 - 20:36:43 CDT

  • Next message: Doug Ewell: "Re: ASCII and Unicode lifespan"

    Kenneth Whistler wrote at 12:54 PM on Thursday, May 19, 2005:

    >Dean Snyder suggested:
    >> Here, off the top of my head, are some problems with Unicode which,
    >> cumulatively, could prove its undoing:
    >> Needless complexity
    >Complex, indubitably.
    >But would you care to document the claim that the complexity
    >is "needless"?

    You're leaving out of your quote Doug's original question:

    "Now, in keeping with this, what problems does Unicode present that will
    lead to its replacement by something better?"

    His question was directed toward the future.

    I never claimed that all the problems I listed were bad decisions at the
    times they were made; in fact, I believe that many of them were
    appropriate in light of the engineering, political, and business
    contexts existent AT THE TIME they were made. But that says little about
    the continuing validity of those decisions into the FUTURE when the
    engineering, political, and business contexts have changed - and they will.

    I can, for example, see a future when 32 bit characters are the minimum
    standard and all hardware dealing with text has the same endianness -
    the current default, big endian ;-) In such environments, multiple text
    encoding forms and schemes and BOMs will be superfluous.

    >> Stateful mechanisms
    >For bidirectional text, yes.
    >But all extant schemes for the representation of bidirectional
    >text involve stateful mechanisms. Would you care to supplant
    >the last decade's work by the bidirectional committee and
    >suggest a non-stateful mechanism that meets the same requirements
    >for the representation of bidirectional text?

    Any text directionality not directly associated with the characters
    themselves belongs to the realm of markup.

    >> No support for a clean division between text and meta-text
    >Would you care to suggest replacements for such widely
    >implemented W3C standards as HTML and XML?

    I already have.

    Just look at the mess of stateful magic escape sequence mechanisms
    needed when one is dealing with meta-text embedded in text embedded in
    meta-text ... This could all be obviated if there were a single bit in
    every encoded character with the assigned semantics of text vs. meta-
    text - a completely robust and unambiguous distinction associated with
    every character (something I threw out for comment when I first joined
    this list over five years ago).

    >> Errors in actual content
    >Well, there's that. But any list longer than 30 items generally
    >has at least 1 error in it.
    >Generations of Chinese scholars have spent 2500 years trying
    >to get "the" list of Chinese characters correct. Never have,
    >never will.

    What do you mean by "correct"? Complete or accurate, or both?

    And anyway, have Chinese scholars made any progress? And if so should
    the results be standardized?

    In other words, what's your point? "Don't introduce new information or
    corrections into standards"?

    >> Legacy sludge
    >This is the point on which I (and a number of other Unicode
    >participants) are most likely to agree with you. The legacy
    >sludge in Unicode was the cost of doing business, frankly.
    >Legacy compatibility was what made the standard successful,
    >because it could and can interoperate with the large number of bizarre
    >experiments in character encoding which preceded it.

    But my point was that this will not be perpetrated into the foreseeable

    >At some point, probably measured more in decades than in years,
    >the importance of all that legacy sludge will drop to the
    >level of irrelevance except for dedicated archivists and
    >digital archaeologists. When that happens, some bright,
    >young generation is going to say, "Hey, we could clean all
    >of that sludge out of Unicode and have a much more
    >consistent and easier to implement character encoding
    >standard. Whadya think? Should we try making it happen?"
    >And chances are, they *will* make it happen, eventually.

    That's all I'm saying - except I am not limiting the corrections to
    legacy encoding problems.

    >> Irreversibility
    >Irreversibility is the nature of standards. Nothing is more
    >harmful to a standard -- particularly a widely implemented
    >standard -- than trying to retract things from it that have
    >already been implemented. That is a fast track to fractionation
    >into incompatible, non-interworking, de facto variants of the

    But irreversibility is a fast track to obsolescent, inadequate, ill-
    working, kludge-ridden, and/or unnecessarily complex standards. I
    notice, for example, that you did not mention SGML in your list of XML
    and HTML above.

    In other words, the internal stability you are endorsing is a sure
    recipe for external instability. Put another way - inflexibility leads
    to breakage when the forces are strong enough. This is why I am for
    designing flexibility into certain standards themselves. Basically, one
    would be replacing the revolutionary design/freeze/break/re-design/
    cycle by accommodating re-design itself into an orderly, controlled, and
    evolutionary cycle. As I have already suggested I believe the best
    compromise between internal and external stability to be versioning, or
    "controlled, documented instability" ;-)

    >> Probably the single most important, and extremely simple, step to a
    >> better encoding would be to force all encoded characters to be 4 bytes.
    >Naive in the extreme. You do realize, of course, that the entire
    >structure of the internet depends on protocols that manipulate
    >8-bit characters, with mandated direction to standardize their
    >Unicode support on UTF-8?

    Simply put - you're not looking as far into the future as I am. (But you
    do seem to be doing so in one of your paragraphs above.)

    Engineering is commonly a struggle between the ideal and the practical,
    with today's ideal often becoming tomorrow's practical.

    >> As just one example of the kind of architectural change that could drive
    >> new encoding schemes, one could propose an encoding design that self-
    >> references its own mutability, thereby redefining "stability" to include
    >> not only extensibility but also reversibility. This would be
    >> accomplished by dedicating as version indicators, e.g., 7 of the 32 bits
    >> in every 4 byte character.
    >Whew! You started off your list of problems that may prove the undoing
    >of Unicode with "needless complexity". And the first architectural
    >change you suggest is putting version indication stamps in 7 bits of
    >32 bit characters?! Any software engineer I know would hoot such
    >a proposal off the stage for introducing needless complexity into
    >string processing. Sorry, but that one is a nonstarter.

    We'll see.

    Versioning obviously involves complexity, but it is not needless
    complexity In fact it solves the terrible problem of irreversibility and
    subsequent breaking, and is the best compromise between stability and

    In such an architecture most of the complexity would be neatly
    encapsulated in the different versions - "Their system supports only
    version 3 of MultiCode; our system supports versions 1 to 4" "Our
    utility transcodes between all four versions of MultiCode."

    I believe the net effect would actually be less complexity overall,
    especially for those implementers that support only the most recent
    version of such a standard (relying on others for support of older versions).


    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897

    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 23:08:09 CDT