From: Hans Aberg (firstname.lastname@example.org)
Date: Tue Jan 25 2005 - 13:36:49 CST
At 08:50 +0000 2005/01/24, Arcane Jill wrote:
>C8: A process shall not assume that it is required to interpret any particular
>* Processes that interpret only a subset of Unicode characters are allowed;
>there is no
>blanket requirement to interpret all Unicode characters.
>Important bit that - processes that interpret only a subset of Unicode
>characters /are allowed/. (According to my strictly mathematical upbringing,
>the empty set is a subset of Unicode characters, so a process that interprets
>/no/ Unicode characters is conformant by this definition). Okay, so you don't
>have to interpret ALL characters, and the BOM is just a character, so you don't
>have to interpret it. So it would appear from /this/ conformance clause that a
>conformant process is allowed to interpret some subset of Unicode characters
>which excludes the BOM. But wait - there's another clause which seems to
In my opinion, the Unicode standard mixes things into the bag that should
not properly be there, which adds to the confusion. Regardless of what the
Unicode standard actually says, it should (in my opinion) define:
1. Characters and character numbering, plus well formed character strings.
2. Encoding schemes.
Here, 1 just provides the characters, and deals with defining well-formed
character sequences. And 2 just deals with the question of well-formed
encoded sequences. An encoded string is well formed if it has correct
encoded form, and translates into a well-formed Unicode string
There are in effect a sequence of different protocols describing 1 and 2.
Processes and files should not be mixed into this bag, but regulated by
other protocols. Unicode should merely remark that such questions are
outside the protocols of 1 and 2 above, and should neither encourage or
discourage any particular practises in this category.
This is perhaps what the Unicode standard tries to say above, in a clumsy
>C12b: When a process interprets a byte sequence which purports to be in a
>encoding scheme, it shall interpret that byte sequence according to the byte
>specifications for the use of the byte order mark established by this standard
>character encoding scheme.
>* Machine architectures differ in ordering in terms of whether the most
>byte or the least significant byte comes first. These sequences are known as
>and "little-endian" orders, respectively.
>* For example, when using UTF-16LE, pairs of bytes must be interpreted as
>code units using the little-endian byte order convention, and any initial <FF
>sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (part of the text),
>rather than as a byte order mark (not part of the text). (See D41.)
>D41: UTF-16LE encoding scheme: The Unicode encoding scheme that serializes a
>code unit sequence as a byte sequence in little-endian format.
>* In UTF-16LE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is
>serialized as <4D 00 30 04 8C 4E 00 D8 02 DF>.
>* In UTF-16LE, an initial byte sequence <FF FE> is interpreted as U+FEFF ZERO
>WIDTH NO-BREAK SPACE.
>For the benefit of those that haven't studied Unicode jargon too closely, there
>is a difference between a "Unicode Encoding FORM" and a "Unicode Encoding
>SCHEME". A Unicode Encoding FORM deals in code units of various widths (8-bit
>for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32). A Unicode Encoding
>SCHEME, on the other hand, always deals in octets. Thus, UTF-16LE is a Unicode
>Encoding SCHEME because it defines a stream of octets, not a stream of
The computer has an internal big/little endian convention on how 16-bit
words should be translated into bytes and bits. It's like if you are only
using 16-bit wchar_t in C. This is the character encoded form. One will
never get to know whether it is high or low endian as a user, unless one
starts to manipulate bits or serializing it into bytes, like when writing to
a file, or converting a C wchar_t string into a char string by coercing the
pointer to it. If you do the latter, one gets the character encoded scheme.
Again, the terminology is not very intuitive. The first might perhaps be
called "character encoded (n-bit) words", and the second "character encoded
>So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is similarly
>prohibited in UTF-16BE, UTF-32LE and UTF-32BE.
No. It seems that the BOM is just the Unicode code point 0xFEFF, reserved
for special private use other the one indicated in the Unicode standard, and
it will have any form that it gets under the applied encoding. So if you
apply UTF-16L and then serialize it into multibytes, it will become the
sequence (OxFF, OxFE).
There is, it seems, nothing particular with the BOM, except that it has been
used by some to indicate a Unicode encoded file contents. It is strange to
be mentioned in the standard in a way giving impression that is has some
kind of special status relative to 1 and 2 above.
This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:52:37 CST