Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jan 21 2005 - 09:33:35 CST

  • Next message: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"

    On 21/01/2005 12:25, Arcane Jill wrote:

    > ... Okay, so you don't have to interpret ALL characters, and the BOM
    > is just a character, so you don't have to interpret it. ...
    >
    > [Jill's Important Question 1]:
    > So the first question I must ask is: Which of these two clauses takes
    > precedence, C8 or C12b?
    >
    > If C12b takes precedence, then when a process interprets a byte
    > sequence which purports to be in the Unicode Encoding Scheme UTF-8, it
    > shall interpret that byte sequence according to the specifications for
    > the use of the byte order mark established by the Unicode Standard for
    > the Unicode Encoding Scheme UTF-8.
    >
    > But if C8 takes precedence, then a process shall not assume that it is
    > required to interpret U+FEFF.
    >
    > They can't both be right.

    This issue may seem arcane, Jill :-) , but it is central to the recent
    dispute.

    As I see it, your mistake here is to assume that "the BOM is just a
    character". It is not, it is something quite different, an element in an
    encoding scheme. And as such, according to C12b, its correct
    interpretation is mandatory.

    The byte sequence corresponding to a BOM, which depends on the encoding
    scheme, has two possible interpretations. One of these interpretations
    is the character U+FEFF ZERO WIDTH NO-BREAK SPACE. The other
    interpretation is as a BOM, which is not a character at all and does not
    form part of the string of characters which is encoded. As a BOM is not
    a character, C8 does not apply.

    In Unicode 1.0, as quoted by Ken yesterday, the BOM was referred to as a
    "Unicode special character". But I note that the quotations from the
    conformance clauses of Unicode 4.0 carefully avoid calling the BOM a
    character. On the other hand, in section 15.9 of Unicode 4.0, although
    this section describes "code points that are interpreted as neither
    control nor graphic characters", the BOM is referred to as a "special
    interpretation" of "the character U+FEFF". It seems to me that this
    wording confuses the issue, especially because later in the same section
    "U+FEFF also has significance as a character" refers only to the
    interpretation as ZERO WIDTH NO-BREAK SPACE. For consistency, it would
    be better to refer only to "the code point U+FEFF", or to "the character
    U+FEFF" only when this code point is interpreted as ZERO WIDTH NO-BREAK
    SPACE. This requires some rather minor editing to section 15.9.

    >
    > [Jill's Important Question 2]:
    > And the second question I must ask is: if a file is labelled by some
    > higher level protocol (for example, Unix locale, HTTP header, etc) as
    > "UTF-8", should a conformant process interpret that as UTF-8, the
    > Unicode Encoding FORM (which prohibits a BOM) or as UTF-8, the Unicode
    > Encoding SCHEME (which allows one)?
    >
    Excellent question! And what if it is not labelled at all, but expected
    to be UTF-8?

    But meanwhile, a practical suggestion for Unix systems and users. Text
    files originating on other systems may include a number of conventions
    which are not native to Unix, such as CRLF for line breaks, and also
    BOMs. For these to be processed correctly by Unix systems, they need to
    be converted to use Unix conventions. Such a conversion would include
    stripping out BOMs, and also perhaps (at least if the locale is UTF-8)
    conversion from other UTF's to UTF-8. In the Windows world such a
    conversion might be implemented best by specifying a new mode for
    opening a file. But I guess that in the Unix world it would be best to
    use a filter here. It would be rather trivial, using ICU or similar, to
    write such a filter. This filter could be invoked by default when
    opening or saving Internet downloads, e-mail attachments etc, perhaps
    depending on the MIME type. Users might need to decide for themselves
    whether to use this filter when reading files received from other
    systems on removable media.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005
    


    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 10:35:27 CST