Re: Actually, this wasn't rhetorical

From: Hans Aberg (
Date: Tue Jan 25 2005 - 13:36:49 CST

  • Next message: Hans Aberg: "Re: <<NONCHAR>> for flex"

    At 08:50 +0000 2005/01/24, Arcane Jill wrote:
    >C8: A process shall not assume that it is required to interpret any particular
    >coded character
    >* Processes that interpret only a subset of Unicode characters are allowed;
    >there is no
    >blanket requirement to interpret all Unicode characters.
    >Important bit that - processes that interpret only a subset of Unicode
    >characters /are allowed/. (According to my strictly mathematical upbringing,
    >the empty set is a subset of Unicode characters, so a process that interprets
    >/no/ Unicode characters is conformant by this definition). Okay, so you don't
    >have to interpret ALL characters, and the BOM is just a character, so you don't
    >have to interpret it. So it would appear from /this/ conformance clause that a
    >conformant process is allowed to interpret some subset of Unicode characters
    >which excludes the BOM. But wait - there's another clause which seems to
    >contradict that...

    In my opinion, the Unicode standard mixes things into the bag that should
    not properly be there, which adds to the confusion. Regardless of what the
    Unicode standard actually says, it should (in my opinion) define:
    1. Characters and character numbering, plus well formed character strings.
    2. Encoding schemes.

    Here, 1 just provides the characters, and deals with defining well-formed
    character sequences. And 2 just deals with the question of well-formed
    encoded sequences. An encoded string is well formed if it has correct
    encoded form, and translates into a well-formed Unicode string

    There are in effect a sequence of different protocols describing 1 and 2.
    Processes and files should not be mixed into this bag, but regulated by
    other protocols. Unicode should merely remark that such questions are
    outside the protocols of 1 and 2 above, and should neither encourage or
    discourage any particular practises in this category.

    This is perhaps what the Unicode standard tries to say above, in a clumsy

    >C12b: When a process interprets a byte sequence which purports to be in a
    >Unicode character
    >encoding scheme, it shall interpret that byte sequence according to the byte
    >order and
    >specifications for the use of the byte order mark established by this standard
    >for that
    >character encoding scheme.
    >* Machine architectures differ in ordering in terms of whether the most
    >byte or the least significant byte comes first. These sequences are known as
    >and "little-endian" orders, respectively.
    >* For example, when using UTF-16LE, pairs of bytes must be interpreted as
    >code units using the little-endian byte order convention, and any initial <FF
    >sequence is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE (part of the text),
    >rather than as a byte order mark (not part of the text). (See D41.)
    >D41: UTF-16LE encoding scheme: The Unicode encoding scheme that serializes a
    >code unit sequence as a byte sequence in little-endian format.
    >* In UTF-16LE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is
    >serialized as <4D 00 30 04 8C 4E 00 D8 02 DF>.
    >* In UTF-16LE, an initial byte sequence <FF FE> is interpreted as U+FEFF ZERO
    >For the benefit of those that haven't studied Unicode jargon too closely, there
    >is a difference between a "Unicode Encoding FORM" and a "Unicode Encoding
    >SCHEME". A Unicode Encoding FORM deals in code units of various widths (8-bit
    >for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32). A Unicode Encoding
    >SCHEME, on the other hand, always deals in octets. Thus, UTF-16LE is a Unicode
    >Encoding SCHEME because it defines a stream of octets, not a stream of

    The computer has an internal big/little endian convention on how 16-bit
    words should be translated into bytes and bits. It's like if you are only
    using 16-bit wchar_t in C. This is the character encoded form. One will
    never get to know whether it is high or low endian as a user, unless one
    starts to manipulate bits or serializing it into bytes, like when writing to
    a file, or converting a C wchar_t string into a char string by coercing the
    pointer to it. If you do the latter, one gets the character encoded scheme.

    Again, the terminology is not very intuitive. The first might perhaps be
    called "character encoded (n-bit) words", and the second "character encoded

    >So, by definition D41, a BOM is /prohibited/ in UTF-16LE. It is similarly
    >prohibited in UTF-16BE, UTF-32LE and UTF-32BE.

    No. It seems that the BOM is just the Unicode code point 0xFEFF, reserved
    for special private use other the one indicated in the Unicode standard, and
    it will have any form that it gets under the applied encoding. So if you
    apply UTF-16L and then serialize it into multibytes, it will become the
    sequence (OxFF, OxFE).

    There is, it seems, nothing particular with the BOM, except that it has been
    used by some to indicate a Unicode encoded file contents. It is strange to
    be mentioned in the standard in a way giving impression that is has some
    kind of special status relative to 1 and 2 above.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 14:52:37 CST