Re: Demystifying the Politburo (was: Re: Arabic encoding model (alas, static!))

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 08 2005 - 20:22:33 CDT

  • Next message: Gregg Reynolds: "Re: Demystifying the Politburo (was: Re: Arabic encoding model (alas, static!))"

    Gregg said:

    > And as a matter of policy I see no reason why a *standards* body
    > (especially an industry standard body) should have a requirement for
    > native speaker participation; after all, the (industry-defined) goal is
    > to get a standard, not to make everybody happy. No doubt such
    > participation is desirable, but it's quite a different thing to say it's
    > required. Standards have to work in the marketplace in order to become
    > standards.

    Correct. But as a by the way, and perhaps no surprise to you, historically
    both the UTC and particularly WG2 have been extraordinarily polyglot,
    both in terms of first languages, second languages, and languages
    known by study.

    >
    > On the other hand, it's pretty obvious (to me at least) that
    > participation of native speakers in standardization of cultural
    > artifacts like written language is a Good Thing.

    Yes. Oh, and for anyone following along here, an oft-overlooked section
    of the standard is the Acknowledgments -- on this question in
    particular for Unicode 4.0, see pp. vi-vii. Unlike ISO standards,
    which, for various procedural reasons, spring unauthored and
    unacknowledged from the brow of an ISO Secretariat, the Unicode
    Standard has always made a serious effort to try to acknowledge
    the many, many people from around the world who contributed in one
    way or another to the ongoing construction of the massive edifice.

    > (List: I know, I
    > know, Unicode does not encode written language, it encodes
    > characters/scripts/whatever. But the perception will always and
    > inevitably be that it is an encoding or modeling of written language.)

    This, however, is a serious stumbling block.

    The Unicode Standard does not standardize languages. It does not
    standardize writing systems. It does not standardize orthographies.
    It does not standardize alphabets or syllabaries.
    It does not standardize spelling systems. It does not standardize
    letters. It does not standardize fonts. It does not standardize
    formats for written materials. Heck, it does not standardize
    *anything* about written language that an average well-educated user
    of that language would recognize.

    So no wonder people get confused. If they ask well, then what *does*
    it standardize, we say encoded characters for scripts and for
    sets of symbols. And then they may well come back with the moral
    equivalent of "Well, those are the letters of my alphabet for my
    language, and they look screwed up to me."

    And so we start all over again trying to explain the basics of
    character encoding and how that relates to the implementation of
    writing systems on computers. And some people get it and some
    people never do.

    > On the fourth hand, it's also clear (to me at least) that Unicode works
    > great for some linguistic communities and not so great for others. (You
    > knew it was coming, and here it is: Unicode is very bad indeed for the
    > RTL community in general and Arabic in particular. ;-) This gets back
    > to the design principles (and the interests that drive them) of Unicode,
    > which work better for some languages than others.

    for some... writing systems than others.

    But I think that has a much (or more) to do with the writing
    systems as with Unicode principles in particular.

    The plain truth is that some writing systems are much more
    straightforward and simple to implement in a digital information
    system than others. Arabic is particularly difficult for a number
    of reasons: it is written right-to-left, which is the non-predominant
    order for most writing systems and which happens (for obvious
    reasons) not to be the default that computer systems were
    originally designed for; it standardized on cursive form printing,
    and has very complex and important calligraphic traditions;
    it is a consonant writing system, with various layers of "dotting"
    than have built up on the consonant spine over the centuries for
    indicating voweling, consonant diacritics for adaptation of
    the script to other languages, and multiple levels of annotation
    for sacred, sung, or chanted text. And, like Latin and Cyrillic,
    it has been a widespread, cosmopolitan, "empire" script, which
    means it has huge variation and lots of adaptation issues as it
    moved from language to language and area to area. And perhaps
    not least, it is the script of an important sacred book, which
    means that it is fraught with religion, as well as all the
    usual cultural identity issues associated with scripts.

    Furthermore, as much as it would be nice to have Arabic simply
    be implemented consistently right-to-left, in any *practical*
    implementation, you *must* deal with bidirectionality.

    I realize that you think you may have a better mousetrap in
    approaching the problem of encoding Arabic text than the
    encoding used in the Unicode Standard --- but...
    However you cut the pie, you are still faced with the
    difficulties that the script presents you in dealing with
    the basic information processing requirements: keyboard
    input, text storage, searching, sorting, editing, layout
    and rendering, and so on. The whole stack of information
    processing has to function -- and has to function in the
    context of existing software systems, data storage technologies,
    databases, fonts, libraries, internet protocols, and on and
    on ... or you haven't got any solution at all. Just ideas
    and a theory.
     
    > And then there are the pragmatic issues which you have outlined
    > concisely in another message.

    Yep.

    For a character encoding, in particular, it has to not only
    work de novo, but to have any success at all, it has to
    function in transition from whatever exists a priori, and
    has to have a 20-year transition strategy during which
    existing data stores convert, interoperate, and don't cause
    unavoidable confusion, ambiguity, and data loss.

    That, by the way, is part of that proverbial high mountain
    I was talking about in an earlier post.

    > Personally, I think Unicode is (well, may be) of enormous historical
    > significance, yet it flies almost entirely under the cultural radar, at
    > least in the US. I daresay most places in the world that will
    > eventually be heavily influenced by Unicode are more or less oblivious
    > to it.

    I agree. But then you could say the same thing about ASCII and
    Latin-1 before it. They are part of the guts of information
    technology, and most people are oblivious to the details, as they
    are regarding nearly *all* technology of whatever sort.

    > > http://linguistics.berkeley.edu/sei/
    > >
    >
    > Thanks, very interesting. I see many of the scripts being worked on
    > list one "Everson" as the contact. Who is this mysterious and
    > ubiquitous "Everson", anyway? Is it one person? Sounds an awful lot
    > like the fictional Cecil Adams to me:
    > (http://www.straightdope.com/index.html)

    There have been reliable sightings of an "Everson" in at least
    9 widely separated locations around the globe just within the
    last year. Our best intelligence estimate at the moment is that
    this must be an organization of agents numbering at least 5 -- with
    elaborate disguises -- to account for all the activity involved.

    The front for this organization can be seen at:

    http://www.evertype.com

    The "Everson" heard recently posting on this very list bears
    little apparent resemblance to the "Everson" I had tea with
    in Xiamen, China last January. :-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Jul 08 2005 - 20:24:03 CDT