Re: leading BOMs in encoding schemes (was: Re: 32'nd bit & UTF-8)

From: Philippe VERDY (
Date: Sat Jan 22 2005 - 09:58:40 CST

  • Next message: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"

    > Again, many people have addressed this point and you're ignoring
    > them.
    > UTF-8 HAS NO BOM. There is nothing in the Unicode standard mandating
    > or
    > even encouraging the use of EF BB BF at the beginning of a UTF-8
    > file.
    > That sequence has no special meaning in UTF-8; it's just a zero-width
    > non-breaking space. FE FF at the top of a UTF-8 file is just flat
    > illegal.
    > The practice of using EF BB BF as a signature byte to indicate that a
    > file is in UTF-8 is mentioned in one spot in the standard, but not
    > encouraged. Some applications (notably Notepad) do this; many do
    > not.

    You're too much affirmative here:

    - Using U+FFFE for meaning a zero-width-non-breaking-space is now STRONGLY discouraged, since WORD JOINER has been standardized. Unicode made this clear, so that U+FFFE usage would remain encouraged ONLY as a byte-order mark, or a signature for Unicode encoding.

    - UTF-8, in opposition to what you affirn, DOES HAVE a BOM. Not in the encoding form, but in the UTF-8 encoding scheme, where it MUST be recognized as such. This differenciation between encoding forms and encoding-schemes (which are largely discussed and detailed in the Unicode standard) means that a U+FFFE encoded at the beginning of a encoding scheme translates only to a void string, and thus effectively it maps to NO abstract character in Unicode encoding forms. It's like if there was no BOM in any Unicode encoding forms; so U+FFFE encoded at the start of a stream of bytes under the UTF-8 endoing scheme is not a character; in that sense, it's true that there's no BOM encoded as an abstract character in Unicode, but it's still true that the U+FFFE codepoint is allocated and precisely described (very close to the standard assignments of surrogate or non-character codepoints, which are also allocated and described despite they don't encode abstract characters).

    Don't reduce Unicode or ISO-10646 only to their encoded repertoire of abstract characters. Both standards also contain some codepoints dedicated specially for special use which are not mapped to abstract characters.

    For sure, these special codepoints are not part of "plain text" definition. They are permanently reserved and assigned for technical reasons. So keeping U+FFFE assigned, with a deprecated (and highly discouraged now) role as a abstract character, but with a important (and standard) technical role for use in encoding schemes can't be removed.

    So keep this distinction clear: encoding schemes do not only serialize encoding forms. They also contain some additional byte sequences that are used to specify or control the effective format of the serialization, or that allow disambiguating it.

    Once you have realized that encoding schmes are used for I/O on a sequencial stream, it's true that you can define what is the beginning of the stream, so there's no ambiguity about how to interpret a U+FFFE codepoint at the beginning of a stream of bytes.

    Then realize that all Unicode algorithms do not work at the level of encoding schemes or even at the level of encoding forms. These algorithms work at the higher level of complete codepoints mapped to abstract characters.

    So the special role of U+FFFE at the begining of some standard Unicode encoding *scheme* like CESU-8, UTF-8, UTF-16 or UTF-32 (I don't say UTF-16BE, UTF-16LE, UTF-32BE which are specifically designed as restrictions of the more general UTF-16 and UTF-32 encoding schemes) is standard. It can't be removed and thanks it is useful.

    It's also true that Unicode already offers alternative encoding *schemes* for those applications that can't support leading BOMs in encoding schemes: UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.

    But it's also true that Unicode should also standardize a restriction of the UTF-8 encoding scheme, for those application that don't want or can't support leading BOMs at the beginning of a bytes-stream.
    So to solve all those problems and discussions, why not standardizing the "UTF-8N" encoding scheme as the restriction of the "UTF-8" encoding scheme with NO leading BOM ?

    Such standard restrictions could also be added to other standard *schemes*:
    - "BOCU-1N"; same as "BOCU-1" but with no leading BOM.
    - "CESU-8N": same as "CESU-8" but with no leading BOM.
    - "SCSU-N": same as "SCSU" but with no leding BOM.
    Unicode could also as well say that all Unicode-conforming encoding schemes should be supporting a leading BOM in an "open" version, unless a second encoding scheme is specifically designed as a restriction of the first one which admits the leading BOM. Unicode should also clearly say that there's effectively NO BOM in the corresponding standard encoding form, and that the mapping between the encoding form and a restricted encoding scheme is fully bijective.

    For applications using the standard "open to BOM" encoding schemes, there must also exist a non-lossy conversion for strings that start with the U+FFFE abstract character.

    The only way to achieve this is that a string of abstract characters (or codepoints) or an encoding form of that string, that starts with the U+FFFE codepoint, MUST encode a supplementary leading BOM in the encoding scheme, so that this leading BOM can be safely ignored when converted back to an encoding form, the second U+FFFE codepoint being left unchanged.

    This would solve almost of those discussions, as now programmers would be given the choice for the appropriate encoding scheme to use in various contexts.

    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:53:16 CST