RE: Conformance (was UTF, BOM, etc)

From: Richard T. Gillam (rgillam@las-inc.com)
Date: Fri Jan 21 2005 - 13:26:10 CST

  • Next message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"

    Jill--

    >[Jill's Important Question 1]:
    >So the first question I must ask is: Which of these two clauses takes
    >precedence, C8 or C12b?
    >
    >If C12b takes precedence, then when a process interprets a byte
    sequence which
    >purports to be in the Unicode Encoding Scheme UTF-8, it shall interpret
    that
    >byte sequence according to the specifications for the use of the byte
    order
    >Mark established by the Unicode Standard for the Unicode Encoding
    Scheme UTF-8.
    >
    >But if C8 takes precedence, then a process shall not assume that it is
    required
    >to interpret U+FEFF.
    >
    >They can't both be right.

    Peter Kirk had this one right. Certain encoding SCHEMES treat the byte
    sequence FEFF (or some variant of it) as a byte order mark when it
    appears at the beginning of a text stream. In these contexts, it's not
    a character at all; it's part of the communication protocol. A process
    operating on the actual text, after it's been deserialized and converted
    into an in-memory representation (an encoding FORM), doesn't see it.

    Other encoding schemes don't treat FEFF as special. A process operating
    on the actual text after it's been deserialized will see this as the
    character U+FEFF, the ZWNBSP.

    >[Jill's Important Question 2]:
    >And the second question I must ask is: if a file is labelled by some
    higher
    >level protocol (for example, Unix locale, HTTP header, etc) as "UTF-8",
    should
    >a conformant process interpret that as UTF-8, the Unicode Encoding FORM
    (which
    >prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which allows
    one)?

    UTF-8 is both an encoding form and an encoding scheme, and it doesn't do
    anything special with EF BB BF. It always comes through as U+FEFF, the
    ZWNBSP. Applications that use EF BB BF as a signal that the text stream
    is in UTF-8 and not some other encoding are implementing a higher-level
    protocol based on UTF-8. UTF-8 itself doesn't treat this sequence as
    special.

    For that matter, applications that use the full panoply of
    signature-byte sequences (0000FEFF for UTF-32BE, FFFE0000 to UTF-32LC,
    FEFF for UTF-16BE, FFFE for UTF-16LE, EF BB BF for UTF-8, etc.) to
    determine whether a byte stream is Unicode and what Unicode encoding
    scheme it is are also implementing a higher-level protocol based on
    Unicode.

    >What with all the BOM difficulties, and the fact that U+FEFF doubles up
    as ZERO
    >WIDTH NO-BREAK SPACE, a new possibility occured to me.
    >
    >Imagine if the codepoint U+D7FD were reserved as NOP, having properties
    which
    >essentially made it completely ignorable and invisible. It could simply
    be
    >thrown away, whereever it were encounted.

    This isn't a bad idea, but it's pretty much unnecessary. With Unicode
    3.2, the meaning of U+FEFF as ZWNBSP was deprecated and a new character,
    U+2060 WORD JOINER, was introduced to fulfill the ZWNBSP function. Over
    time, this means you'll see more and more applications that use U+2060
    to glue things together and treat U+FEFF as a no-op. These applications
    will have some backward-compatibility problems (older documents will
    have some "glued" sequences coming "unglued"), but this will die out.
    In fact, I think the more recent versions of Unicode make it legal to
    turn U+FEFF into U+2060 without documenting you're changing the text.

    --Rich Gillam
      Language Analysis Systems



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:29:16 CST