Re: Unicode & space in programming & l10n

From: Steve Summit (scs@eskimo.com)
Date: Sun Sep 17 2006 - 21:16:47 CDT

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    Pardon me for making what may sound like a cavalier and
    irresponsible argument, and for restating several of the same
    points Mark Davis made, and for generally proceeding in a manner
    that, I know, won't be convincing to the die-hard skeptics, but:
    worrying about any alleged space "inefficiency" of Unicode sounds
    like the worst kind of false economy. This is not, after all,
    1960, or 1972, or even 1990.

    Today, hardly anyone does anything with plain text. Everyone
    uses HTML, or XML, or Microsoft Word .doc, or PDF. All of these
    formats bloat the byte count -- sometimes quite spectacularly --
    beyond what a hypothetical flat-ASCII representation would
    consume, yet few are worrying about this. (To be sure, there are
    some naysayers and handwringers and foot-draggers here, too, but
    the marketplace has generally ignored them, and nothing seems to
    have come to a screeching halt in the face of all those popular
    yet bulkier formats.)

    And it's not just that we've moved past plain text to fancy text:
    we've moved past text to graphics, and audio, and video. A few
    years ago iPods and other MP3 players were storing absurdly large
    amounts of music in absurdly small volumes. Today they're
    storing video, too. Given a device that's tricked out with enough
    storage to hold useful amounts of video, the amount of *text* it
    can store is for all intents and purposes infinite. (Last night
    I downloaded Wikipedia -- all of its text -- to my laptop.
    Hardly made a dent.)

    So even if there were no good reason for it, no one would
    (or should) be complaining if for one reason or another text is a
    mere factor of 2 bigger than it used to be, when everything else
    (the aggregate size of the other data we're trying to store, and
    the capacities of the devices we're storing it on) is orders and
    orders of magnitude bigger than it used to be.

    And, of course, it's not at all the case that "there's no good
    reason for it". Internationalization is an eminently worthy goal.
    The uniform way in which Unicode permits internationalization is
    tremendously beneficial. Sweeping away the old biases in favor
    of 8-byte Roman text is a very fine thing. (For someone to be
    carping that there's still some "bias" towards Roman scripts even
    under Unicode is a stunning example of missing the forest for the
    trees.)

    Yes, it's more work to write software that uses Unicode than
    used 7-bit ASCII. But (a) 7-bit ASCII just isn't an option any
    more (the world expects i18n), and (b) it's a hell of a lot
    easier to use Unicode than to use the welter of incompatible
    national character sets it replaced, and (c) it's *happening*.
    The support is out there: the tools, the libraries, the fonts,
    the whole nine yards. It's not pulling teeth to use Unicode
    these days; it's almost as easy as using 7-bit ASCII used to be.
    Most of the hard work has been done.

    We work in a tremendously wasteful industry. Hundreds if
    not thousands of man-years are wasted rewriting existing
    functionality in new languages du jour. Megabytes and
    gigabytes of memory and disk space are wasted on glitzy little
    user interface gewgaws that have nothing to do with fundamental
    usability or functionality. Modern programming languages and
    development environments allow barely-trained, careless
    programmers to churn out mountainously complex systems that,
    somehow, mostly work, and are not much more than a factor of 10
    or 100 times bigger, and a factor of 10 or 100 times slower,
    than equivalently-functional hand-crafted microoptimized
    assembler would be. All of this waste, and then some,
    disappears inside the relentlessly marching maw of Moore's law.

    In the face of all that, am I willing to expend a factor of two
    expansion in raw text encoding, in order to support worldwide i18n?
    In a heartbeat.

    Now, I do understand that there remain a few aberrations. SMS
    text messages, as I understand it, are still limited to 160 bytes
    or some absurd number. (On phones that all have cameras in them
    now, and are themselves beginning to support video!) And there
    will always be those few naysayers and handwringers and foot-
    draggers, beating the dead horse of the factor-of-N text expansion
    as if it's some new revelation, or an interesting argument.
    (Personally, I suspect their concern about memory usage is just a
    smokescreen for various kinds of xenophobia. Either they don't
    want to internationalize at all, or they're still harboring one
    of those myopic little grudges about some particular aspect of
    the way Unicode did it.) But those aberrations are just that:
    aberrations.

    I'm no scholar on this subject -- as anyone who cares about
    citable references has seen, there weren't any next to any of
    the pulled-out-of-the-air numbers I've been brandishing in this
    message -- but from where I sit, there's really no argument about
    Unicode any more. It's basically here, and by all appearances to
    stay, and I'm certainly glad to have it that way.



    This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 21:19:01 CDT