RE: Actually, this wasn't rhetorical

From: Peter Constable (petercon@microsoft.com)
Date: Mon Jan 24 2005 - 14:34:48 CST

  • Next message: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"

    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    On
    > Behalf Of Arcane Jill

    > OFFICIAL TEXT:
    > =================================================
    > C8: A process shall not assume that it is required to interpret any
    > particular
    > coded character
    > representation.
    >
    > * Processes that interpret only a subset of Unicode characters are
    allowed;
    > there is no
    > blanket requirement to interpret all Unicode characters.
    > =================================================
    >
    > JILL'S INTERPRETATION:
    >
    > Important bit that - processes that interpret only a subset of Unicode
    > characters /are allowed/. (According to my strictly mathematical
    > upbringing,
    > the empty set is a subset of Unicode characters, so a process that
    > interprets
    > /no/ Unicode characters is conformant by this definition).

    A process that interprets no Unicode characters doesn't thereby break
    any conformance requirement. The question is, does it support Unicode in
    any meaningful way? Perhaps. E.g. a data comm transport process that
    passes all UTF-8 code units but doesn't interpret any sequences of them
    can reasonably be said to support and be conformant to Unicode (as
    opposed to another process that can only handle octets with values 0 -
    127).

    > Okay, so you don't
    > have to interpret ALL characters, and the BOM is just a character, so
    you
    > don't have to interpret it.

    Is the minor premise of this syllogism correct? The conformance
    criterion refers to a "coded character representation", which is an
    entity at the Coded Character Set level of the encoding model. The BOM
    is an element of a coded character scheme -- a different level of the
    encoding model. So, I'm not sure that your syllogism is logically valid.

    > So it would appear from /this/ conformance clause
    > that a
    > conformant process is allowed to interpret some subset of Unicode
    > characters
    > which excludes the BOM. But wait - there's another clause which seems
    to
    > contradict that...

    [quotes C12b and D41]

    > So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is
    similarly
    > prohibited in UTF-16BE, UTF-32LE and UTF-32BE.

    True. That doesn't mean a UTF-16LE cannot begin with the octet sequence
    0xFF 0xFE; it just means that, if it does, then that octet sequence is
    interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE.

     
    > However, (and this is where it gets complicated):
    >
    > In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial
    U+FEFF
    > represents ZERO WIDTH NO-BREAK SPACE.
    > In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial
    > U+FEFF is not considered part of the text.

    Careful; you're not distinguishing between the coded character U+FEFF
    and an octet sequence such as 0xFF 0xFE, or 0xEF 0xBB 0xBF. One is an
    entity of a CCS; the others are sequences of entities at the level of
    CES. The Standard indicates that an initial octet sequence 0xEF 0xBB
    0xBF *may* be interpreted at the CES level as a UTF-8 BOM. The Standard
    neither requires nor recommends this, however; a conformant process may
    rather interpret that initial sequence as the coded character
    representation U+FEFF, which in turn it may or may not interpret as
    ZWNBSP.

    > [Jill's Important Question 1]:
    > So the first question I must ask is: Which of these two clauses takes
    > precedence, C8 or C12b?

    Precedence is not a question here. Both apply.

     
    > If C12b takes precedence, then when a process interprets a byte
    sequence
    > which
    > purports to be in the Unicode Encoding Scheme UTF-8, it shall
    interpret
    > that
    > byte sequence according to the specifications for the use of the byte
    > order
    > mark established by the Unicode Standard for the Unicode Encoding
    Scheme
    > UTF-8.
    >
    > But if C8 takes precedence, then a process shall not assume that it is
    > required
    > to interpret U+FEFF.
    >
    > They can't both be right.

    Your logical error occurred above, when you failed to distinguish
    between a coded character representation at the CCS level and an octet
    sequence at the CES level. Both C12b and C8 apply. If a byte sequence
    purports to be in the UTF-8 encoding scheme and begins with the octet
    sequence 0xEF 0xBB 0xBF, then a conformant process *may* interpret that
    as the BOM, or as the coded character representation U+FEFF. If it does
    the latter, it may or may not interpret that coded character
    representation as ZWNBSP. (But it must not interpret it as any other
    abstract character.)

    > [Jill's Important Question 2]:
    > And the second question I must ask is: if a file is labelled by some
    > higher
    > level protocol (for example, Unix locale, HTTP header, etc) as
    "UTF-8",
    > should
    > a conformant process interpret that as UTF-8, the Unicode Encoding
    FORM
    > (which
    > prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which
    allows
    > one)?

    This may be up to the higher-level protocol, but in general I would say
    that it must begin by interpreting it as a Unicode Encoding Scheme. But
    it is not required to interpret the initial octet sequence 0xEF 0xBB
    0xBF as the CES-level construct known as BOM; it may, but is not
    required to do so.

    Peter Constable



    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 14:36:16 CST