Re: Actually, this wasn't rhetorical

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jan 24 2005 - 19:31:06 CST

  • Next message: Curtis Clark: "Re: I Heart Huckabees"

    On 24/01/2005 20:34, Peter Constable wrote:

    > ...
    >
    >>However, (and this is where it gets complicated):
    >>
    >>In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial
    >>
    >>
    >U+FEFF
    >
    >
    >>represents ZERO WIDTH NO-BREAK SPACE.
    >>In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial
    >>U+FEFF is not considered part of the text.
    >>
    >>
    >
    >Careful; you're not distinguishing between the coded character U+FEFF
    >and an octet sequence such as 0xFF 0xFE, or 0xEF 0xBB 0xBF. One is an
    >entity of a CCS; the others are sequences of entities at the level of
    >CES. The Standard indicates that an initial octet sequence 0xEF 0xBB
    >0xBF *may* be interpreted at the CES level as a UTF-8 BOM. The Standard
    >neither requires nor recommends this, however; a conformant process may
    >rather interpret that initial sequence as the coded character
    >representation U+FEFF, which in turn it may or may not interpret as
    >ZWNBSP.
    >
    >
    >
    Peter, I am surprised at your "The Standard neither requires nor
    recommends this", as I read it as stating the opposite, that this
    interpretation is mandatory. I note from C12b as quoted by Jill:

    > C12b: When a process interprets a byte sequence which purports to be
    > in a Unicode character encoding scheme, it shall interpret that byte
    > sequence according to the byte order and specifications for the use of
    > the byte order mark established by this standard for that character
    > encoding scheme.

    And then in the definition of the UTF-8 encoding scheme, D39:

    > When represented in UTF-8, the byte order mark turns into the byte
    > sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream
    > is neither required nor recommended by the Unicode Standard, but its
    > presence does not affect conformance to the UTF-8 encoding scheme.

    Thus the "specifications for the use of the byte order mark established
    by this standard for that character encoding scheme" include that the
    sequence <EF BB BF> at the beginning of a UTF-8 data stream represents a
    BOM. While this usage of the BOM is "neither required nor recommended by
    the Unicode Standard", the wording in C12b "it shall interpret" surely
    implies that the interpretation of this byte sequence as either a BOM or
    U+FEFF is mandatory, and that a process which interprets the sequence
    always as the coded character representation U+FEFF is non-conformant.

    But I note from the last paragraph of p.81 of TUS 4.0 that there is a
    serious issue of ambiguity here. In my opinion, the ambiguity could be
    resolved if it was clearly specified that the sequence <EF BB BF> at the
    beginning of a UTF-8 data stream always represents a BOM, and not U+FEFF
    - while maintaining the recommendation against using this sequence,
    except in the very rare case in which a UTF-8 data stream is intended to
    start with the deprecated character U+FEFF, in which case it should be
    mandatory to include the BOM and so start the file with <EF BB BF EF BB BF>.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005
    


    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:38:09 CST