Re: Stateful encoding mechanisms

From: Dean Snyder (dean.snyder@jhu.edu)
Date: Thu May 19 2005 - 12:07:45 CDT

  • Next message: Peter Constable: "RE: ASCII and Unicode lifespan"

    Peter Constable wrote at 8:46 AM on Thursday, May 19, 2005:

    >Note that surrogates, BOM and annotation characters (FFF9..FFFB) are not
    >used in the text content of a file:

    I do not understand why you are making the irrelevant, and even
    partially wrong, assertion here that surrogates, BOM and annotation
    characters are not used in the text content of a file?

    I was not addressing the concept of "text content of a file"; I
    specifically addressed "stateful mechanisms for plain text encoding".

    Surely you are not denying that surrogates, BOM and annotation
    characters are stateful mechanisms? It is irrelevant for the discussion
    of stateful mechanisms in encoding and the problems they pose for
    fragment interpretability whether or not those mechanisms are in the
    text content; they are in the text stream and must be dealt with.

    And for that matter, I don't understand why you left out the bidi
    operators here, which I also mentioned. Do you consider them part of the
    text content?

    SURROGATES:

    The Unicode Standard 4.1, section 3.9
    "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
    represented as
    <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."

    How can you say that, for example, the surrogates in this very example
    in TUS are not used in text content?

    BOM:

    The Unicode Standard 4.1, section 15.8
    "Detection of U+FFFE at the start of an input stream should be taken as
    a strong indication that the input stream should be byte-swapped before
    interpretation."

    Note the use of the word "strong" here, signaling the BOM's ambiguity. U
    +FEFF can occur almost anywhere in a text stream but if it is a BOM it
    is used to interpret the text content, and is therefore, by definition,
    a stateful mechanism. Notice the troublesome possibility of a text
    fragment that happens to begin with U+FEFF used originally as a zero
    width no-break space but now "should be taken as a strong [yet wrong]
    indication that the input stream should be byte-swapped before
    interpretation".

    ANNOTATION CHARACTERS:

    The Unicode Standard 4.1, section 15.9
    "For all regular editing and text-processing algorithms, the annotated
    characters
    are treated as part of the text stream. The annotating text is also part
    of the content,
    but for all or some text processing, it does not form part of the main
    text stream."

    Obviously the annotation characters themselves are not rendered but they
    are rendering triggers, and they are in the text stream, and they are
    stateful.

    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897
    www.jhu.edu/digitalhammurabi/
    http://users.adelphia.net/~deansnyder/



    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 14:07:09 CDT