From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 03:44:44 CST
Richard T. Gillam wrote:
> Peter Kirk had this one right. Certain encoding SCHEMES
> treat the byte
> sequence FEFF (or some variant of it) as a byte order mark when it
> appears at the beginning of a text stream. In these
> contexts, it's not
> a character at all; it's part of the communication protocol.
Not a character at all? Very well put! It is exactly what it should be. A
non-character. So not only the reverse-BOM, but also the BOM should both be
non-characters.
> A process
> operating on the actual text, after it's been deserialized
> and converted
> into an in-memory representation (an encoding FORM), doesn't see it.
>
And might treat the BOM as NOP. Whether this should be done at processing
time or at deserialization is up to the implementation. Either could prove
to be impractical or dangerous. Just a thought.
> Other encoding schemes don't treat FEFF as special. A
> process operating
> on the actual text after it's been deserialized will see this as the
> character U+FEFF, the ZWNBSP.
This is where the problem lies. In effort to make the BOM as harmless as
possible, sloppiness was allowed. A lot is spoken about differentiating text
from binary data. Well, then those people should also be strict about
differentiating plain text from serialized documents.
Back to Notepad - it produces documents, not plain text. For that matter,
Microsoft should provide a plain text editor, or extend Notepad with that
capability. But it is really up to them. They can leave it to other people
to do it. After all, in Windows, you don't need a text editor. There is no
plain text in Windows. Which is sometimes good, and sometimes bad.
Lars
This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 03:45:30 CST