About Encoding Theory (was: Re: Again not about Phoenician)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Nov 08 2004 - 20:30:26 CST

  • Next message: Deborah Goldsmith: "Re: Looking for a C library that converts UTF-8 strings from their decomposed to pre-composed form"

    Peter Kirk suggested:

    > I am suggesting that the best way to get the job done properly is to lay
    > the conceptual foundation properly first, instead of trying to build a
    > structure on a foundation which doesn't match...

    Part of the problem that I think some people are having here,
    including Peter, is that they are ascribing the wrong level
    to the Unicode Standard itself.

    The Unicode Standard is a character encoding standard. What
    it standardizes are the numerical codes for representing
    abstract characters (plus quite a number of related things
    having to do with character properties and algorithms for
    manipulating characters in text to do various things).

    The Unicode Standard is *NOT* a standard for the theory
    or process of character encoding. It does not spell out
    the rules whereby character encoding committees are
    constrained in their process, nor does it lay down
    specifications that would allow anyone to follow some
    recipe in determining what "thing" is a separate script
    and what is not, nor what "entity" is an appropriate
    candidate for encoding as a character and what is not.

    Ultimately, *those* kinds of determinations are made by
    the character encoding committees, based on argumentation
    made in proposals, by proponents and opponents, and in
    the context of legacy practice, potential cost/benefit
    tradeoffs for existing and prospective implementations,
    commitments made to stability, and so on. They don't consist
    of the encoding committees -- either one of them --
    turning to the Unicode Standard, page whatever, or
    ISO/IEC 10646, page whatever, to find the rule which
    determines what the answer is. In fact the answers evolve
    over time, because the demands on the standard evolve,
    the implementations evolve, and the impact of the dead
    hand of legacy itself changes over time.

    It is all fine and good for people to point out the dynamic
    nature of scripts themselves -- their historic connections
    and change over time, which often make determinations whether
    to encode particular instantiations at particular times in
    history as a "script" in the character encoding standard
    notably difficult.

    But I would suggest that people bring an equivalently
    refined historical analysis to the process of character
    encoding itself. We are dealing with a *very* complex set
    of conflicting requirements here for the UCS, and attempting
    a level of coverage over the entire history of writing
    systems in the world. Even *cataloging* the world's
    writing systems is immensely controversial -- let alone
    trying to hammer some significant set of "historical nodes"
    into a set of standardized encoded characters that can
    assist in digital representation of plain text content
    of the world's accumulated and prospective written heritage.

    Contrary to what Peter is suggesting, I think it is putting
    the cart before the horse to expect a standard theory of
    script encoding to precede the work to actually encode
    characters for the scripts of the world.

    The Unicode Standard will turn out the way it does, with
    all its limitations, warts, and blemishes, because of a
    decades-long historical process of decisions made by
    hundreds of people, often interacting under intense pressure.

    Future generations of scholar will study it and point out
    its errors.

    Future generations of programmers will continue to use it
    as a basis for information processing, and will continue
    to program around its limitations.

    And I expect that *THEN* a better, comprehensive theory of
    script and symbol encoding for information processing will
    be developed. And some future generation of information
    technologists will rework the Unicode encoding into a new standard
    of some sort, compatible with then-existing "legacy" Unicode
    practice, but avoiding most of the garbage, errors, and
    8-bit compatibility practice that we currently have to
    live with, for hundreds of accumulated (and accumulating)


    This archive was generated by hypermail 2.1.5 : Mon Nov 08 2004 - 20:33:13 CST