Re: Stateful encoding mechanisms

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 19 2005 - 16:51:09 CDT

  • Next message: Dominikus Scherkl: "Re: AW: AW: ASCII and Unicode lifespan"

    From: "Dean Snyder" <dean.snyder@jhu.edu>
    > SURROGATES:
    >
    > The Unicode Standard 4.1, section 3.9
    > "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
    > represented as
    > <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."
    >
    > How can you say that, for example, the surrogates in this very example
    > in TUS are not used in text content?

    A stream of code units is NOT text content. "Text" means a stream of
    (abstract) characters, i.e. of *assigned* code points. Nothing guarantees in
    a UTF-16 code unit stream that these code units represent text, or even that
    the represented codepoints are characters: they may be <unassigned>, i.e.
    <reserved> for future allocation, or <non-characters>.

    > BOM:
    > The Unicode Standard 4.1, section 15.8
    > "Detection of U+FFFE at the start of an input stream should be taken as
    > a strong indication that the input stream should be byte-swapped before
    > interpretation."
    >
    > Note the use of the word "strong" here, signaling the BOM's ambiguity. U
    > +FEFF can occur almost anywhere in a text stream but if it is a BOM it
    > is used to interpret the text content, and is therefore, by definition,
    > a stateful mechanism. Notice the troublesome possibility of a text
    > fragment that happens to begin with U+FEFF used originally as a zero
    > width no-break space but now "should be taken as a strong [yet wrong]
    > indication that the input stream should be byte-swapped before
    > interpretation".

    The BOM is NOT a character. The BOM is NOT the code point U+FEFF. A BOM is
    only a code unit that may be present within a stream of code units, and
    which appears to have the same value as the code unit of the (deprecated and
    not recommanded) ZWNBSP character assigned at code point U+FEFF.

    In a UTF-16 encoding *scheme* the leading BOM is fully ignorable. But in a
    UTF-16 encoding form, there's simply NO BOM and the codepoint U+FEFF is
    legal and represents ZWNSP.

    You are mixing several levels in the Unicode character model.



    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 16:52:02 CDT