Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (
Date: Sat Jan 22 2005 - 07:44:58 CST

  • Next message: Lokesh Joshi: "Need help for Arabic text processing"

    On 22/01/2005 09:44, Lars Kristan wrote:

    > ...
    > Not a character at all? Very well put! It is exactly what it should
    > be. A non-character. So not only the reverse-BOM, but also the BOM
    > should both be non-characters.

    Agreed. The UTC and the ISO guys messed up when they allowed the
    alternative interpretation as the character ZWNBSP. And they have more
    or less admitted it by deprecating ZWNBSP. Unfortunately this dual
    interpretation has made things much worse.


    > And might treat the BOM as NOP. Whether this should be done at
    > processing time or at deserialization is up to the implementation.
    > Either could prove to be impractical or dangerous. Just a thought.

    I realise that my account yesterday of what a process might do in these
    circumstances was a bit confused. What is a higher level protocol in
    these circumstances, and what is a lower level? Perhaps the following
    description might help.

    As I interpret the Unicode standard, there are four different notional
    interfaces (of course these don't all have to be separately visible in
    real implementations) which need to be considered here, across which the
    data transferred is as follows:

    A. Strings of abstract Unicode characters.
    B. Sequences of Unicode code points.
    C. Sequences of code units in a Unicode encoding form.
    D. Streams of bytes according to a Unicode encoding scheme.

    I note that form D is required only for storage and transfer (assumed to
    be byte-oriented operations), as internal operations may operate
    directly on code units.

    When a string of characters is converted to a byte stream, the BOM
    certainly should not be included at interface A. Nor should it be
    included at interface C as it is not part of the Unicode encoding form.
    So it must be the responsibility of the serialisation process which
    converts form C to form D to add a BOM when this is required.

    When the reverse process takes place, it is certainly not correct,
    whatever the encoding form, to pass the BOM across all of these
    interfaces and present it at interface A as the character U+FEFF at the
    start of the character string, or at interface B as the code point
    U+FEFF. Indeed it should not even be present at interface C as again it
    is not part of the Unicode encoding form. So it must again be the
    responsibility of the deserialisation process which converts form D to
    form C also to remove the BOM. This process is of course complicated by
    the dual interpretation of the signature bytes.

    Although in other ways deserialisation of UTF-8 is trivial, the need to
    strip out the BOM makes it more than a no-op, or the "process UTF-8 data
    as it is" which you mentioned elsewhere.

    The implication of this is that the BOM signature bytes, if found at the
    start of a byte stream in any encoding scheme and so intended as a BOM
    rather than as ZWNBSP, should not even be decoded as the code point
    U+FEFF, but should be stripped from the stream before conversion at the
    very earliest stage.

    > ...
    > This is where the problem lies. In effort to make the BOM as harmless
    > as possible, sloppiness was allowed. A lot is spoken about
    > differentiating text from binary data. Well, then those people should
    > also be strict about differentiating plain text from serialized documents.
    > Back to Notepad - it produces documents, not plain text. For that
    > matter, Microsoft should provide a plain text editor, or extend
    > Notepad with that capability. But it is really up to them. They can
    > leave it to other people to do it. After all, in Windows, you don't
    > need a text editor. There is no plain text in Windows. Which is
    > sometimes good, and sometimes bad.
    Well, I think this depends on how you define "plain text". I define
    "plain text" as a string of characters which represent text with no
    markup etc. This is what plain text is on Windows. And when this string
    of characters is saved as a file encoded in UTF-8, Windows (or at least
    some Windows applications) indicates this encoding (as permitted but not
    encouraged in the Unicode standard) by preceding the string of
    characters with a BOM, which is not one of those characters. But your
    definition of "plain text" seems rather different, more like a string of
    arbitrary bytes which is supposed to have some interpretation as
    characters but whose encoding is unknown at this level, rather like the
    serialised data passed across my interface D above. This is perhaps
    more like what Unix does in practice. But I don't think it is helpful to
    define "plain text" in this way.

    Windows presumes that batch files (a DOS concept) and all other
    non-Unicode data (including that saved by Notepad in "ANSI" mode) are
    encoded according to the system's default code page. This cannot be
    UTF-8, and so these files cannot start with a BOM (although in principle
    they can start with a UTF-8 BOM signature interpreted as three
    characters in the code page). Of course the system gets confused if a
    UTF-8 file is passed to a process which expects a file in a code page
    format. This confusion might be reduced if Windows recognised BOM
    signatures at the start of files opened by non-Unicode processes and
    pre-converted them to the system code page (with loss of data for
    characters not supported by the code page). But this strategy is
    dangerous because BOM signatures are legal as bytes in legacy and binary
    data, and because some non-Unicode processes intend to operate on the
    data at the byte level. And so this is not done by default.

    The implication of this is that the only safe way is to indicate every
    file's encoding out of band. Unfortunately this cannot be done reliably.
    Windows goes some way towards doing this with its file extension
    mechanism. This actually makes it difficult to create batch files with
    Notepad (the extension has to be changed manually), but it is still only
    a partial answer.

    Peter Kirk (personal) (work)
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005

    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:50:52 CST