From: Arcane Jill (arcanejill@ramonsky.com)
Date: Fri Jan 21 2005 - 06:25:17 CST
I don't claim to be an expert, and my interpretation may be wrong. But there
are two questions at the bottom of this email which cloud my understanding of
the issues. I would appreciate it if someone who actually knows what they're
talking about could answer these for me. Thanks. Anyway...
I have found and studied the relevant conformance clauses which DEFINE Unicode
Conformance, in particular with regard to the BOM. All quotes are from TOS Ch 3
(Conformance). There are two sections which seem to me to be relevant:
OFFICIAL TEXT:
=================================================
C8: A process shall not assume that it is required to interpret any particular
coded character
representation.
* Processes that interpret only a subset of Unicode characters are allowed;
there is no
blanket requirement to interpret all Unicode characters.
=================================================
JILL'S INTERPRETATION:
Important bit that - processes that interpret only a subset of Unicode
characters /are allowed/. (According to my strictly mathematical upbringing,
the empty set is a subset of Unicode characters, so a process that interprets
/no/ Unicode characters is conformant by this definition). Okay, so you don't
have to interpret ALL characters, and the BOM is just a character, so you don't
have to interpret it. So it would appear from /this/ conformance clause that a
conformant process is allowed to interpret some subset of Unicode characters
which excludes the BOM. But wait - there's another clause which seems to
contradict that...
OFFICIAL TEXT:
=================================================
C12b: When a process interprets a byte sequence which purports to be in a
Unicode character
encoding scheme, it shall interpret that byte sequence according to the byte
order and
specifications for the use of the byte order mark established by this standard
for that
character encoding scheme.
* Machine architectures differ in ordering in terms of whether the most
significant
byte or the least significant byte comes first. These sequences are known as
"bigendian"
and "little-endian" orders, respectively.
* For example, when using UTF-16LE, pairs of bytes must be interpreted as
UTF-16
code units using the little-endian byte order convention, and any initial <FF
FE>
sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (part of the text),
rather than as a byte order mark (not part of the text). (See D41.)
=================================================
and...
OFFICIAL TEXT:
=================================================
D41: UTF-16LE encoding scheme: The Unicode encoding scheme that serializes a
UTF-16
code unit sequence as a byte sequence in little-endian format.
* In UTF-16LE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is
serialized as <4D 00 30 04 8C 4E 00 D8 02 DF>.
* In UTF-16LE, an initial byte sequence <FF FE> is interpreted as U+FEFF ZERO
WIDTH NO-BREAK SPACE.
=================================================
JILL'S INTERPRETATION:
For the benefit of those that haven't studied Unicode jargon too closely, there
is a difference between a "Unicode Encoding FORM" and a "Unicode Encoding
SCHEME". A Unicode Encoding FORM deals in code units of various widths (8-bit
for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32). A Unicode Encoding
SCHEME, on the other hand, always deals in octets. Thus, UTF-16LE is a Unicode
Encoding SCHEME because it defines a stream of octets, not a stream of
16-bit-words.
So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is similarly
prohibited in UTF-16BE, UTF-32LE and UTF-32BE.
However, (and this is where it gets complicated):
In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial U+FEFF
represents ZERO WIDTH NO-BREAK SPACE.
In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial U+FEFF is
not considered part of the text.
[Jill's Important Question 1]:
So the first question I must ask is: Which of these two clauses takes
precedence, C8 or C12b?
If C12b takes precedence, then when a process interprets a byte sequence which
purports to be in the Unicode Encoding Scheme UTF-8, it shall interpret that
byte sequence according to the specifications for the use of the byte order
mark established by the Unicode Standard for the Unicode Encoding Scheme UTF-8.
But if C8 takes precedence, then a process shall not assume that it is required
to interpret U+FEFF.
They can't both be right.
[Jill's Important Question 2]:
And the second question I must ask is: if a file is labelled by some higher
level protocol (for example, Unix locale, HTTP header, etc) as "UTF-8", should
a conformant process interpret that as UTF-8, the Unicode Encoding FORM (which
prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which allows one)?
Jill
This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 06:30:30 CST