Re: About Encoding Theory (was: Re: Again not about Phoenician)

From: Peter Kirk (
Date: Tue Nov 09 2004 - 07:51:27 CST

  • Next message: Mark Davis: "Public Review Items"

    On 09/11/2004 02:30, Kenneth Whistler wrote:

    >Peter Kirk suggested:
    >>I am suggesting that the best way to get the job done properly is to lay
    >>the conceptual foundation properly first, instead of trying to build a
    >>structure on a foundation which doesn't match...
    >Part of the problem that I think some people are having here,
    >including Peter, is that they are ascribing the wrong level
    >to the Unicode Standard itself.

    Maybe. But why is this? Is it because the Standard describes itself
    misleadingly? Is it because it has been oversold? Is it because people
    who are looking for a conceptual framework look to the text of the
    Standard, and think they have found one there when in fact what they
    find is something different?

    For example, a professor described on this list as one of the most
    famous in his field wrote that each of the proposers and supporters of a
    script proposal "either does not understand Unicode or (and probably
    "and") does not understand what a glyph is" (quoted on this list in May
    this year). Implicitly his criticism applies even to the majority of UTC
    members who accepted the proposal. Was he being unreasonable? What was
    his basis for claiming to understand Unicode better than the UTC
    members? I can't speak for the professor, but I would suppose that his
    claim to understand Unicode is based to a large extent on his reading of
    the Standard, and explanations from others who have read it. If this
    professor, a leading expert in his field, is finding such
    inconsistencies, and as a result of them is slandering the UTC and
    rejecting Unicode, doesn't this suggest that there is something wrong?

    >The Unicode Standard is *NOT* a standard for the theory
    >or process of character encoding. It does not spell out
    >the rules whereby character encoding committees are
    >constrained in their process, nor does it lay down
    >specifications that would allow anyone to follow some
    >recipe in determining what "thing" is a separate script
    >and what is not, nor what "entity" is an appropriate
    >candidate for encoding as a character and what is not.

    It does not normatively specify such things, agreed. But it does appear
    to describe them, at least in outline, in its informative section
    entitled "Unicode Design Principles". And these outline descriptions are
    misleading. All I am asking is that the misleading text be adjusted so
    that it is not misleading and is consistent with the actual practice of
    the UTC. I have proposed one way to do so. You may prefer another way,
    perhaps something like replacing "Characters are the abstract
    representations of the smallest components of written language that have
    semantic value." on p.15 by "... the smallest components of written
    language which have been determined by the character encoding committees
    to be usefully distinguishable." That may be too obviously ad hoc, but
    at least it stops people trying to interpret "semantic value" as
    something of theoretical significance.

    >... Even *cataloging* the world's
    >writing systems is immensely controversial -- let alone
    >trying to hammer some significant set of "historical nodes"
    >into a set of standardized encoded characters that can
    >assist in digital representation of plain text content
    >of the world's accumulated and prospective written heritage.

    Indeed. But if such a standardised set is to be generally acceptable,
    the controversies have to be resolved, and they should be resolved by
    open discussion and diplomatic decision-making, not by imposition of one
    view and accusations that those who hold other views are not "reasonable".

    >Contrary to what Peter is suggesting, I think it is putting
    >the cart before the horse to expect a standard theory of
    >script encoding to precede the work to actually encode
    >characters for the scripts of the world.

    Well, a standard theory is more than what I was asking for. I was
    looking for an accurate summary description of the criteria currently
    being used; or failing that, at least deletion of the current inaccurate

    >The Unicode Standard will turn out the way it does, with
    >all its limitations, warts, and blemishes, because of a
    >decades-long historical process of decisions made by
    >hundreds of people, often interacting under intense pressure.
    >Future generations of scholar will study it and point out
    >its errors.
    >Future generations of programmers will continue to use it
    >as a basis for information processing, and will continue
    >to program around its limitations.

    I agree, of course, that Unicode will not be perfect. But that is not an
    argument not to do the best job we can do now. Future scholars will have
    fewer errors to point out if when present-day scholars point out
    supposed errors in proposals they are listened to and not told things
    like "I can't say that I care a fig". And future programmers will have
    fewer limitations to program around, at great expense, if more care is
    taken to avoid defining and stabilising such limitations. Anyway, what
    is the great hurry? There may be one with certain modern scripts, but I
    don't see much urgency with historic scripts. Just listening more and
    taking more care will help to put off the inevitable *THEN* when Unicode
    has to be replaced.

    >And I expect that *THEN* a better, comprehensive theory of
    >script and symbol encoding for information processing will
    >be developed. And some future generation of information
    >technologists will rework the Unicode encoding into a new standard
    >of some sort, compatible with then-existing "legacy" Unicode
    >practice, but avoiding most of the garbage, errors, and
    >8-bit compatibility practice that we currently have to
    >live with, for hundreds of accumulated (and accumulating)

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Tue Nov 09 2004 - 12:53:24 CST