From: Arcane Jill (firstname.lastname@example.org)
Date: Fri Jan 21 2005 - 06:25:17 CST
I don't claim to be an expert, and my interpretation may be wrong. But there
are two questions at the bottom of this email which cloud my understanding of
the issues. I would appreciate it if someone who actually knows what they're
talking about could answer these for me. Thanks. Anyway...
I have found and studied the relevant conformance clauses which DEFINE Unicode
Conformance, in particular with regard to the BOM. All quotes are from TOS Ch 3
(Conformance). There are two sections which seem to me to be relevant:
C8: A process shall not assume that it is required to interpret any particular
* Processes that interpret only a subset of Unicode characters are allowed;
there is no
blanket requirement to interpret all Unicode characters.
Important bit that - processes that interpret only a subset of Unicode
characters /are allowed/. (According to my strictly mathematical upbringing,
the empty set is a subset of Unicode characters, so a process that interprets
/no/ Unicode characters is conformant by this definition). Okay, so you don't
have to interpret ALL characters, and the BOM is just a character, so you don't
have to interpret it. So it would appear from /this/ conformance clause that a
conformant process is allowed to interpret some subset of Unicode characters
which excludes the BOM. But wait - there's another clause which seems to
C12b: When a process interprets a byte sequence which purports to be in a
encoding scheme, it shall interpret that byte sequence according to the byte
specifications for the use of the byte order mark established by this standard
character encoding scheme.
* Machine architectures differ in ordering in terms of whether the most
byte or the least significant byte comes first. These sequences are known as
and "little-endian" orders, respectively.
* For example, when using UTF-16LE, pairs of bytes must be interpreted as
code units using the little-endian byte order convention, and any initial <FF
sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (part of the text),
rather than as a byte order mark (not part of the text). (See D41.)
D41: UTF-16LE encoding scheme: The Unicode encoding scheme that serializes a
code unit sequence as a byte sequence in little-endian format.
* In UTF-16LE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is
serialized as <4D 00 30 04 8C 4E 00 D8 02 DF>.
* In UTF-16LE, an initial byte sequence <FF FE> is interpreted as U+FEFF ZERO
WIDTH NO-BREAK SPACE.
For the benefit of those that haven't studied Unicode jargon too closely, there
is a difference between a "Unicode Encoding FORM" and a "Unicode Encoding
SCHEME". A Unicode Encoding FORM deals in code units of various widths (8-bit
for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32). A Unicode Encoding
SCHEME, on the other hand, always deals in octets. Thus, UTF-16LE is a Unicode
Encoding SCHEME because it defines a stream of octets, not a stream of
So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is similarly
prohibited in UTF-16BE, UTF-32LE and UTF-32BE.
However, (and this is where it gets complicated):
In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial U+FEFF
represents ZERO WIDTH NO-BREAK SPACE.
In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial U+FEFF is
not considered part of the text.
[Jill's Important Question 1]:
So the first question I must ask is: Which of these two clauses takes
precedence, C8 or C12b?
If C12b takes precedence, then when a process interprets a byte sequence which
purports to be in the Unicode Encoding Scheme UTF-8, it shall interpret that
byte sequence according to the specifications for the use of the byte order
mark established by the Unicode Standard for the Unicode Encoding Scheme UTF-8.
But if C8 takes precedence, then a process shall not assume that it is required
to interpret U+FEFF.
They can't both be right.
[Jill's Important Question 2]:
And the second question I must ask is: if a file is labelled by some higher
level protocol (for example, Unix locale, HTTP header, etc) as "UTF-8", should
a conformant process interpret that as UTF-8, the Unicode Encoding FORM (which
prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which allows one)?
This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 06:30:30 CST