RE: Stateful encoding mechanisms

From: Peter Constable (petercon@microsoft.com)
Date: Thu May 19 2005 - 15:10:12 CDT

  • Next message: Philippe Verdy: "Re: what is Latn?"

    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    On Behalf
    > Of Dean Snyder

    > Surely you are not denying that surrogates, BOM and annotation
    > characters are stateful mechanisms?

    Surrogates may be a stateful mechanism, but they are not a stateful
    *character* mechanism. Annotation characters may be stateful, but they
    are intended for use only within software processes, where state is not
    an issue.

    Sure, if someone sends me a file with a sequence < a, b, FFF9, c, d,
    FFFA, e, f, FFFB >, I could cut and paste < d, FFFA, e > into some other
    location, completely messing up the annotation syntax; but they
    shouldn't be creating such content in the first place.

    Sure, if an app that uses UTF-16 representation internally displays a
    surrogate-pair sequence as a pair of boxes I could select a run
    beginning or ending in the middle of such a pair and then make some
    change that would produce garbage; but I don't expect to successfully
    work on supplementary-plane text in an app that doesn't actually support
    supplementary-plane text.

    > And for that matter, I don't understand why you left out the bidi
    > operators here, which I also mentioned. Do you consider them part of
    the
    > text content?

    Yes; that is, they get processed at the same level of representation as
    (say) "a"; they do not get processed in the same levels of
    representation as (say) the BOM or surrogate code units.

    > SURROGATES:
    >
    > The Unicode Standard 4.1, section 3.9
    > "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
    > represented as
    > <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."
    >
    > How can you say that, for example, the surrogates in this very example
    > in TUS are not used in text content?

    By "text content" I was meaning the character content -- i.e. what is
    recognized at the level of character interpretation. (IIUC, analogous, I
    guess, to the notion of "infoset" used in relation to XML and SGML.)
    D800 and DF02 are not characters; they are code units used in the UTF-16
    encoding form. They may be part of a stream, but they are not
    individually part of the character-information content of that stream.

    Peter Constable



    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 15:10:56 CDT