Conformance (was UTF, BOM, etc)

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Fri Jan 21 2005 - 06:25:17 CST

  • Next message: Arcane Jill: "Re: 32'nd bit & UTF-8"

    I don't claim to be an expert, and my interpretation may be wrong. But there
    are two questions at the bottom of this email which cloud my understanding of
    the issues. I would appreciate it if someone who actually knows what they're
    talking about could answer these for me. Thanks. Anyway...

    I have found and studied the relevant conformance clauses which DEFINE Unicode
    Conformance, in particular with regard to the BOM. All quotes are from TOS Ch 3
    (Conformance). There are two sections which seem to me to be relevant:

    OFFICIAL TEXT:
    =================================================
    C8: A process shall not assume that it is required to interpret any particular
    coded character
    representation.

    * Processes that interpret only a subset of Unicode characters are allowed;
    there is no
    blanket requirement to interpret all Unicode characters.
    =================================================

    JILL'S INTERPRETATION:

    Important bit that - processes that interpret only a subset of Unicode
    characters /are allowed/. (According to my strictly mathematical upbringing,
    the empty set is a subset of Unicode characters, so a process that interprets
    /no/ Unicode characters is conformant by this definition). Okay, so you don't
    have to interpret ALL characters, and the BOM is just a character, so you don't
    have to interpret it. So it would appear from /this/ conformance clause that a
    conformant process is allowed to interpret some subset of Unicode characters
    which excludes the BOM. But wait - there's another clause which seems to
    contradict that...

    OFFICIAL TEXT:
    =================================================
    C12b: When a process interprets a byte sequence which purports to be in a
    Unicode character
    encoding scheme, it shall interpret that byte sequence according to the byte
    order and
    specifications for the use of the byte order mark established by this standard
    for that
    character encoding scheme.

    * Machine architectures differ in ordering in terms of whether the most
    significant
    byte or the least significant byte comes first. These sequences are known as
    "bigendian"
    and "little-endian" orders, respectively.

    * For example, when using UTF-16LE, pairs of bytes must be interpreted as
    UTF-16
    code units using the little-endian byte order convention, and any initial <FF
    FE>
    sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (part of the text),
    rather than as a byte order mark (not part of the text). (See D41.)
    =================================================

    and...

    OFFICIAL TEXT:
    =================================================
    D41: UTF-16LE encoding scheme: The Unicode encoding scheme that serializes a
    UTF-16
    code unit sequence as a byte sequence in little-endian format.
    * In UTF-16LE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is
    serialized as <4D 00 30 04 8C 4E 00 D8 02 DF>.
    * In UTF-16LE, an initial byte sequence <FF FE> is interpreted as U+FEFF ZERO
    WIDTH NO-BREAK SPACE.
    =================================================

    JILL'S INTERPRETATION:

    For the benefit of those that haven't studied Unicode jargon too closely, there
    is a difference between a "Unicode Encoding FORM" and a "Unicode Encoding
    SCHEME". A Unicode Encoding FORM deals in code units of various widths (8-bit
    for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32). A Unicode Encoding
    SCHEME, on the other hand, always deals in octets. Thus, UTF-16LE is a Unicode
    Encoding SCHEME because it defines a stream of octets, not a stream of
    16-bit-words.

    So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is similarly
    prohibited in UTF-16BE, UTF-32LE and UTF-32BE.

    However, (and this is where it gets complicated):

    In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial U+FEFF
    represents ZERO WIDTH NO-BREAK SPACE.
    In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial U+FEFF is
    not considered part of the text.

    [Jill's Important Question 1]:
    So the first question I must ask is: Which of these two clauses takes
    precedence, C8 or C12b?

    If C12b takes precedence, then when a process interprets a byte sequence which
    purports to be in the Unicode Encoding Scheme UTF-8, it shall interpret that
    byte sequence according to the specifications for the use of the byte order
    mark established by the Unicode Standard for the Unicode Encoding Scheme UTF-8.

    But if C8 takes precedence, then a process shall not assume that it is required
    to interpret U+FEFF.

    They can't both be right.

    [Jill's Important Question 2]:
    And the second question I must ask is: if a file is labelled by some higher
    level protocol (for example, Unix locale, HTTP header, etc) as "UTF-8", should
    a conformant process interpret that as UTF-8, the Unicode Encoding FORM (which
    prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which allows one)?

    Jill



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 06:30:30 CST