Re: Transliterating ancient scripts [was: ASCII and Unicode lifespan]

From: Dean Snyder (dean.snyder@jhu.edu)
Date: Wed May 25 2005 - 12:51:12 CDT

  • Next message: Hans Aberg: "Re: other human language vehicle coding"

    Gregg Reynolds wrote at 3:01 AM on Tuesday, May 24, 2005:

    >Well I wouldn't argue against the utility of such an encoding; but
    >unfortunately the "transliteration is lossy" argument works against you,
    >for a very simple reason:
    >
    >*computational models of "characters" encode no "glyphic information"*
    >
    >None. Nada. Zipzilchzero. x0041 encodes Latin upper case A; it encodes
    >an identity; it does not encode "glyphic information". Not even a set
    >of glyphs. It's a theoretical impossibility. (btw Unicode has always
    >been a bit confused about this.)
    >
    >And it's fairly easy to see this. There is no rule you can find that
    >will tell you, for any given image, if it is a member of the set of all
    >Latin upper case A glyphs. Pretty much any blob of ink can be construed
    >as "A" in the right context. It's also impossible to enumerate all "A"
    >glpyhs.
    >
    >(Idea for a contest: slap a blob of ink in a random pattern in an
    >em-square; a sufficiently creative typeface designer will be able to
    >design a latin font in which the blob will be recognizably "A". Free
    >beer for a week to the best design.)
    >
    >So even if you encode your ancient scripts, you are not protected
    >against the kind of lossiness you want to avoid. There's always a font
    >and a rendering logic involved. You're lost as soon as you lay finger
    >to keyboard and your idea of a glyph is transl(iter)ated into an
    >integer. To guarantee correct decoding of a message in the way you
    >(seem to) want, you would have to transmit specific glyph images along
    >with the encoded message; in which case there's not much point of
    >designing an encoding.
    >
    >Take a look at Douglas Hofstadter's essays on Metafont in "Metamagical
    >Themas" for some fascinating discussion of such stuff.

    This is all typical, sound-good, philosophical mumbo-jumbo originating
    from wrong-headed escapes into irreality.

    The word "abstract", as used in the phrase "abstract encoded
    characters", does not mean arbitrary, random, chaotic - your blobs of
    ink. If that were true, your email would be unintelligible.

    No, in Unicode an abstract character is an association of a unique code
    point with a unique name, a set of properties, and a unique-within-its-
    subscript representative glyph; it's a sort of contract, or gentleman's
    agreement, that makes possible the efficient and intelligible
    interchange of encoded text. As such, each character (ignoring legacy
    stuff) represents a SEMANTIC and GLYPHIC contrastive unit within its
    script or sub-script. (I'm aware, of course, of edge cases like one and
    el, zero and O, trema and umlaut, where context is used for
    disambiguation. But these are extremely rare within a given script or
    subscript.) I challenge anyone, for example, to show us ANY Arabic font
    that does not have the exactly the same basic shape for "r" and
    "z" (other than, of course, those playful fonts that are specifically
    designed to mimic documents composed by cutting out printed letters from
    different fonts). Such glyphic information is lost in transliteration,
    but is retained for encoded characters in 99.99% of all existing fonts.

    An abstract character is like a genotype, with variable renderings in
    fonts, its phenotypes. Phenotypes form RECOGNIZABLE, CONTRASTIVE
    CLUSTERS around their genotypes. Obviously the amount of stylistic
    variability within any given phenotypical cluster is theoretically
    infinite; but that does not mean that the variability is unbounded or
    random. Playful, perverse, and accidental renderings of abstract
    characters are the exceptions that only prove the rule - they are easily
    recognized for what they are, non-phenotypical "mutations" - and they
    are typically avoided. I haven't seen too many books, newspapers, or
    websites published in dingbats. [Another way to look at this is that
    perverse, playful, or accidental renderings of glyphs could not even be
    recognized as such were it not for the existence of "core" renderings of
    glyphs.]

    There are whole industries, in the real world, built around the concept
    of phenotypical clustering, industries involved in feature detection and
    feature recognition. In the text arena its called optical character
    recognition, and it DEPENDS upon the phenotypical clustering of the
    renderings of abstract characters.

    Read some OCR algorithms if you insist on thinking that "There is no
    rule you can find that will tell you, for any given image, if it is a
    member of the set of all Latin upper case A glyphs." The operative words
    here are "rule" and "all". Just because you cannot formulate the rules
    doesn't mean they don't or can't exist. Even though OCR algorithms
    ("rules") are not as good as the human brain at recognizing characters
    from glyphs, they are becoming more and more sophisticated all the time,
    continually approaching the ideal of recognizing all A's. So there are
    rules - they're just very complex and haven't been completely formalized
    yet. [By the way, this disparity between human and computer glyph
    recognition is the basis for the various human-based glyph recognition
    schemes used by several online services to verify that a respondent is
    indeed a human. But here, again, the exception proves the rule - the
    very success of such glyph-based schemes DEPENDS on the RECOGNIZABILITY
    of those glyphs as phenotypical members of the clusters associated with
    their genotypes, their abstract characters.]

    ***********************************

    It seems that several people here have gotten hung up on my phrase
    "transliteration is lossy" and it is partially my fault. What I have not
    meant to imply, of course, is that encoding is lossless; that would be
    silly, and I presumed would be self-evident to everyone. But I should
    have made a more explicit statement, such as - "Transliteration is
    orders of magnitude more lossy than encoding." I will say, however, that
    in my original post to this thread I did make the statement (one, by the
    way, that has been largely ignored) that, "Encoded scripts more closely
    model autograph text and therefore either enable or greatly improve the
    execution of these activities (without, of course, replacing the need
    for the autopsy of original texts)." And that continues to be the main
    reason why I think ancient scripts should be encoded and not JUST
    transliterated.

    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897
    www.jhu.edu/digitalhammurabi/
    http://users.adelphia.net/~deansnyder/



    This archive was generated by hypermail 2.1.5 : Wed May 25 2005 - 12:53:28 CDT