RE: Conformance (was UTF, BOM, etc)

From: Lars Kristan (
Date: Sat Jan 22 2005 - 03:44:44 CST

  • Next message: Lars Kristan: "RE: Byte-oriented lexer generator for Unicode"

    Richard T. Gillam wrote:

    > Peter Kirk had this one right. Certain encoding SCHEMES
    > treat the byte
    > sequence FEFF (or some variant of it) as a byte order mark when it
    > appears at the beginning of a text stream. In these
    > contexts, it's not
    > a character at all; it's part of the communication protocol.

    Not a character at all? Very well put! It is exactly what it should be. A
    non-character. So not only the reverse-BOM, but also the BOM should both be

    > A process
    > operating on the actual text, after it's been deserialized
    > and converted
    > into an in-memory representation (an encoding FORM), doesn't see it.

    And might treat the BOM as NOP. Whether this should be done at processing
    time or at deserialization is up to the implementation. Either could prove
    to be impractical or dangerous. Just a thought.

    > Other encoding schemes don't treat FEFF as special. A
    > process operating
    > on the actual text after it's been deserialized will see this as the
    > character U+FEFF, the ZWNBSP.

    This is where the problem lies. In effort to make the BOM as harmless as
    possible, sloppiness was allowed. A lot is spoken about differentiating text
    from binary data. Well, then those people should also be strict about
    differentiating plain text from serialized documents.

    Back to Notepad - it produces documents, not plain text. For that matter,
    Microsoft should provide a plain text editor, or extend Notepad with that
    capability. But it is really up to them. They can leave it to other people
    to do it. After all, in Windows, you don't need a text editor. There is no
    plain text in Windows. Which is sometimes good, and sometimes bad.


    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 03:45:30 CST