From: Dean Snyder (dean.snyder@jhu.edu)
Date: Thu May 19 2005 - 20:36:43 CDT
Kenneth Whistler wrote at 12:54 PM on Thursday, May 19, 2005:
>Dean Snyder suggested:
>
>> Here, off the top of my head, are some problems with Unicode which,
>> cumulatively, could prove its undoing:
>> Needless complexity
>
>Complex, indubitably.
>But would you care to document the claim that the complexity
>is "needless"?
You're leaving out of your quote Doug's original question:
"Now, in keeping with this, what problems does Unicode present that will
lead to its replacement by something better?"
His question was directed toward the future.
I never claimed that all the problems I listed were bad decisions at the
times they were made; in fact, I believe that many of them were
appropriate in light of the engineering, political, and business
contexts existent AT THE TIME they were made. But that says little about
the continuing validity of those decisions into the FUTURE when the
engineering, political, and business contexts have changed - and they will.
I can, for example, see a future when 32 bit characters are the minimum
standard and all hardware dealing with text has the same endianness -
the current default, big endian ;-) In such environments, multiple text
encoding forms and schemes and BOMs will be superfluous.
>> Stateful mechanisms
>
>For bidirectional text, yes.
>
>But all extant schemes for the representation of bidirectional
>text involve stateful mechanisms. Would you care to supplant
>the last decade's work by the bidirectional committee and
>suggest a non-stateful mechanism that meets the same requirements
>for the representation of bidirectional text?
Any text directionality not directly associated with the characters
themselves belongs to the realm of markup.
>> No support for a clean division between text and meta-text
>
>Would you care to suggest replacements for such widely
>implemented W3C standards as HTML and XML?
I already have.
Just look at the mess of stateful magic escape sequence mechanisms
needed when one is dealing with meta-text embedded in text embedded in
meta-text ... This could all be obviated if there were a single bit in
every encoded character with the assigned semantics of text vs. meta-
text - a completely robust and unambiguous distinction associated with
every character (something I threw out for comment when I first joined
this list over five years ago).
>> Errors in actual content
>
>Well, there's that. But any list longer than 30 items generally
>has at least 1 error in it.
>
>Generations of Chinese scholars have spent 2500 years trying
>to get "the" list of Chinese characters correct. Never have,
>never will.
What do you mean by "correct"? Complete or accurate, or both?
And anyway, have Chinese scholars made any progress? And if so should
the results be standardized?
In other words, what's your point? "Don't introduce new information or
corrections into standards"?
>> Legacy sludge
>
>This is the point on which I (and a number of other Unicode
>participants) are most likely to agree with you. The legacy
>sludge in Unicode was the cost of doing business, frankly.
>Legacy compatibility was what made the standard successful,
>because it could and can interoperate with the large number of bizarre
>experiments in character encoding which preceded it.
But my point was that this will not be perpetrated into the foreseeable
future.
>At some point, probably measured more in decades than in years,
>the importance of all that legacy sludge will drop to the
>level of irrelevance except for dedicated archivists and
>digital archaeologists. When that happens, some bright,
>young generation is going to say, "Hey, we could clean all
>of that sludge out of Unicode and have a much more
>consistent and easier to implement character encoding
>standard. Whadya think? Should we try making it happen?"
>And chances are, they *will* make it happen, eventually.
That's all I'm saying - except I am not limiting the corrections to
legacy encoding problems.
>> Irreversibility
>
>Irreversibility is the nature of standards. Nothing is more
>harmful to a standard -- particularly a widely implemented
>standard -- than trying to retract things from it that have
>already been implemented. That is a fast track to fractionation
>into incompatible, non-interworking, de facto variants of the
>standard.
But irreversibility is a fast track to obsolescent, inadequate, ill-
working, kludge-ridden, and/or unnecessarily complex standards. I
notice, for example, that you did not mention SGML in your list of XML
and HTML above.
In other words, the internal stability you are endorsing is a sure
recipe for external instability. Put another way - inflexibility leads
to breakage when the forces are strong enough. This is why I am for
designing flexibility into certain standards themselves. Basically, one
would be replacing the revolutionary design/freeze/break/re-design/
cycle by accommodating re-design itself into an orderly, controlled, and
evolutionary cycle. As I have already suggested I believe the best
compromise between internal and external stability to be versioning, or
"controlled, documented instability" ;-)
>> Probably the single most important, and extremely simple, step to a
>> better encoding would be to force all encoded characters to be 4 bytes.
>
>Naive in the extreme. You do realize, of course, that the entire
>structure of the internet depends on protocols that manipulate
>8-bit characters, with mandated direction to standardize their
>Unicode support on UTF-8?
Simply put - you're not looking as far into the future as I am. (But you
do seem to be doing so in one of your paragraphs above.)
Engineering is commonly a struggle between the ideal and the practical,
with today's ideal often becoming tomorrow's practical.
>> As just one example of the kind of architectural change that could drive
>> new encoding schemes, one could propose an encoding design that self-
>> references its own mutability, thereby redefining "stability" to include
>> not only extensibility but also reversibility. This would be
>> accomplished by dedicating as version indicators, e.g., 7 of the 32 bits
>> in every 4 byte character.
>
>Whew! You started off your list of problems that may prove the undoing
>of Unicode with "needless complexity". And the first architectural
>change you suggest is putting version indication stamps in 7 bits of
>32 bit characters?! Any software engineer I know would hoot such
>a proposal off the stage for introducing needless complexity into
>string processing. Sorry, but that one is a nonstarter.
We'll see.
Versioning obviously involves complexity, but it is not needless
complexity In fact it solves the terrible problem of irreversibility and
subsequent breaking, and is the best compromise between stability and
flexibility.
In such an architecture most of the complexity would be neatly
encapsulated in the different versions - "Their system supports only
version 3 of MultiCode; our system supports versions 1 to 4" "Our
utility transcodes between all four versions of MultiCode."
I believe the net effect would actually be less complexity overall,
especially for those implementers that support only the most recent
version of such a standard (relying on others for support of older versions).
Respectfully,
Dean A. Snyder
Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218
office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digitalhammurabi/
http://users.adelphia.net/~deansnyder/
This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 23:08:09 CDT