From: Peter Constable (petercon@microsoft.com)
Date: Mon Jan 24 2005 - 14:34:48 CST
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
On
> Behalf Of Arcane Jill
> OFFICIAL TEXT:
> =================================================
> C8: A process shall not assume that it is required to interpret any
> particular
> coded character
> representation.
>
> * Processes that interpret only a subset of Unicode characters are
allowed;
> there is no
> blanket requirement to interpret all Unicode characters.
> =================================================
>
> JILL'S INTERPRETATION:
>
> Important bit that - processes that interpret only a subset of Unicode
> characters /are allowed/. (According to my strictly mathematical
> upbringing,
> the empty set is a subset of Unicode characters, so a process that
> interprets
> /no/ Unicode characters is conformant by this definition).
A process that interprets no Unicode characters doesn't thereby break
any conformance requirement. The question is, does it support Unicode in
any meaningful way? Perhaps. E.g. a data comm transport process that
passes all UTF-8 code units but doesn't interpret any sequences of them
can reasonably be said to support and be conformant to Unicode (as
opposed to another process that can only handle octets with values 0 -
127).
> Okay, so you don't
> have to interpret ALL characters, and the BOM is just a character, so
you
> don't have to interpret it.
Is the minor premise of this syllogism correct? The conformance
criterion refers to a "coded character representation", which is an
entity at the Coded Character Set level of the encoding model. The BOM
is an element of a coded character scheme -- a different level of the
encoding model. So, I'm not sure that your syllogism is logically valid.
> So it would appear from /this/ conformance clause
> that a
> conformant process is allowed to interpret some subset of Unicode
> characters
> which excludes the BOM. But wait - there's another clause which seems
to
> contradict that...
[quotes C12b and D41]
> So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is
similarly
> prohibited in UTF-16BE, UTF-32LE and UTF-32BE.
True. That doesn't mean a UTF-16LE cannot begin with the octet sequence
0xFF 0xFE; it just means that, if it does, then that octet sequence is
interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE.
> However, (and this is where it gets complicated):
>
> In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial
U+FEFF
> represents ZERO WIDTH NO-BREAK SPACE.
> In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial
> U+FEFF is not considered part of the text.
Careful; you're not distinguishing between the coded character U+FEFF
and an octet sequence such as 0xFF 0xFE, or 0xEF 0xBB 0xBF. One is an
entity of a CCS; the others are sequences of entities at the level of
CES. The Standard indicates that an initial octet sequence 0xEF 0xBB
0xBF *may* be interpreted at the CES level as a UTF-8 BOM. The Standard
neither requires nor recommends this, however; a conformant process may
rather interpret that initial sequence as the coded character
representation U+FEFF, which in turn it may or may not interpret as
ZWNBSP.
> [Jill's Important Question 1]:
> So the first question I must ask is: Which of these two clauses takes
> precedence, C8 or C12b?
Precedence is not a question here. Both apply.
> If C12b takes precedence, then when a process interprets a byte
sequence
> which
> purports to be in the Unicode Encoding Scheme UTF-8, it shall
interpret
> that
> byte sequence according to the specifications for the use of the byte
> order
> mark established by the Unicode Standard for the Unicode Encoding
Scheme
> UTF-8.
>
> But if C8 takes precedence, then a process shall not assume that it is
> required
> to interpret U+FEFF.
>
> They can't both be right.
Your logical error occurred above, when you failed to distinguish
between a coded character representation at the CCS level and an octet
sequence at the CES level. Both C12b and C8 apply. If a byte sequence
purports to be in the UTF-8 encoding scheme and begins with the octet
sequence 0xEF 0xBB 0xBF, then a conformant process *may* interpret that
as the BOM, or as the coded character representation U+FEFF. If it does
the latter, it may or may not interpret that coded character
representation as ZWNBSP. (But it must not interpret it as any other
abstract character.)
> [Jill's Important Question 2]:
> And the second question I must ask is: if a file is labelled by some
> higher
> level protocol (for example, Unix locale, HTTP header, etc) as
"UTF-8",
> should
> a conformant process interpret that as UTF-8, the Unicode Encoding
FORM
> (which
> prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which
allows
> one)?
This may be up to the higher-level protocol, but in general I would say
that it must begin by interpreting it as a Unicode Encoding Scheme. But
it is not required to interpret the initial octet sequence 0xEF 0xBB
0xBF as the CES-level construct known as BOM; it may, but is not
required to do so.
Peter Constable
This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 14:36:16 CST