Re: ASCII and Unicode lifespan

From: John H. Jenkins (
Date: Thu May 19 2005 - 10:23:33 CDT

  • Next message: Peter Constable: "RE: Stateful encoding mechanisms"

    On May 18, 2005, at 8:55 PM, Alexander Kh. wrote:

    > That I realize. Especially when it is Microsoft who's paying most
    > part of the
    > bill - I totally foresee that their systems will be based on what
    > they payed
    > for. However, many people still pay for traffic, and switching from
    > local
    > encoding to unicode will mean double the traffic right away.
    > However, if using
    > state-machine approach, encodings can be changed on-the-fly by
    > using a special
    > escape-code. That's one way of getting benifits of both approach,
    > not to mention
    > the fact that local encodings are more well-thought in design.

    What you're talking about already exists. It's called ISO 2022 and
    it was (comparatively) a failure.

    The advantages you cite for multiple encodings are real, on the
    whole. It *does* add to storage and other overhead, it *does*
    consume resources, and it *does* add to the complexity of systems.
    If the user intends nothing but brain-dead English, then going beyond
    ASCII really is unnecessary.


    There are also reasons why Unicode has succeeded in ways that ISO
    2022 has not.

    1) State engines require keeping track of state. Unicode has the
    advantage that you can begin parsing text in the middle and be able
    to find your way relatively quickly. Encodings where state must be
    tracked mean you can't do this; you need to scan all the way back to
    the beginning (potentially) to get your state information.

    2) For system and application developers, the complexity does not go
    away. My company does business throughout the world, and so we have
    to be prepared for our software and the software written for our
    system to work with all the writing systems of the world. Indeed,
    even if we didn't do business in all the world the problem doesn't go
    away. Wandering around Hong Kong, for example, where one would think
    that Han and English were enough, I see signs in Nepalese and Thai.
    I don't even care to list the number of languages on signage and in
    newspapers in the US.

    Having an ISO 2022-type approach means that not only do I have to
    keep track of all the complexity that Unicode requires but I must
    *also* deal with the additional headache of the bookkeeping
    associated with the multiple encodings (converting data back and
    forth, among other things) *and* the bookkeeping of maintaining the
    state information. If I'm writing a word processor, it means I have
    to be prepared for the document to switch character sets halfway

    In other words, you don't save effort at all. A state-based multiple-
    encoding world is considerably *more* difficult for the programmer.
    All you save is storage space on disk and transmission, and in
    today's world, that's really not an enormous cost anymore.

    3) For users of minority and rare languages and scripts, the fact
    that there has to be additional effort to create and maintain
    software which supports their particular needs means that their needs
    are never met. (So far as I know, nobody ever implemented ISO 2022
    in all its glory; they just had a specific market they wanted to
    focus on and stuck there.) Large companies aren't willing to invest
    that effort for small markets, so there isn't support at the system
    level, and shoe-horning support into the system by a third party is
    difficult if not impossible. (I know whereof I speak, having written
    the Deseret Language Kit for Mac OS 8.) With the Unicode approach,
    since you get every script and language for free, additional scripts
    and languages can be supported via add-ons with minimal effort. Even
    third-party add-ons will work in most cases with relatively little

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 10:24:26 CDT