Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Jan 22 2005 - 07:44:58 CST

Next message: Lokesh Joshi: "Need help for Arabic text processing"

Previous message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
In reply to: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Next in thread: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 22/01/2005 09:44, Lars Kristan wrote:

> ...
>
> Not a character at all? Very well put! It is exactly what it should
> be. A non-character. So not only the reverse-BOM, but also the BOM
> should both be non-characters.
>

Agreed. The UTC and the ISO guys messed up when they allowed the
alternative interpretation as the character ZWNBSP. And they have more
or less admitted it by deprecating ZWNBSP. Unfortunately this dual
interpretation has made things much worse.

...

> And might treat the BOM as NOP. Whether this should be done at
> processing time or at deserialization is up to the implementation.
> Either could prove to be impractical or dangerous. Just a thought.
>

I realise that my account yesterday of what a process might do in these
circumstances was a bit confused. What is a higher level protocol in
these circumstances, and what is a lower level? Perhaps the following
description might help.

As I interpret the Unicode standard, there are four different notional
interfaces (of course these don't all have to be separately visible in
real implementations) which need to be considered here, across which the
data transferred is as follows:

A. Strings of abstract Unicode characters.
B. Sequences of Unicode code points.
C. Sequences of code units in a Unicode encoding form.
D. Streams of bytes according to a Unicode encoding scheme.

I note that form D is required only for storage and transfer (assumed to
be byte-oriented operations), as internal operations may operate
directly on code units.

When a string of characters is converted to a byte stream, the BOM
certainly should not be included at interface A. Nor should it be
included at interface C as it is not part of the Unicode encoding form.
So it must be the responsibility of the serialisation process which
converts form C to form D to add a BOM when this is required.

When the reverse process takes place, it is certainly not correct,
whatever the encoding form, to pass the BOM across all of these
interfaces and present it at interface A as the character U+FEFF at the
start of the character string, or at interface B as the code point
U+FEFF. Indeed it should not even be present at interface C as again it
is not part of the Unicode encoding form. So it must again be the
responsibility of the deserialisation process which converts form D to
form C also to remove the BOM. This process is of course complicated by
the dual interpretation of the signature bytes.

Although in other ways deserialisation of UTF-8 is trivial, the need to
strip out the BOM makes it more than a no-op, or the "process UTF-8 data
as it is" which you mentioned elsewhere.

The implication of this is that the BOM signature bytes, if found at the
start of a byte stream in any encoding scheme and so intended as a BOM
rather than as ZWNBSP, should not even be decoded as the code point
U+FEFF, but should be stripped from the stream before conversion at the
very earliest stage.

> ...
>
> This is where the problem lies. In effort to make the BOM as harmless
> as possible, sloppiness was allowed. A lot is spoken about
> differentiating text from binary data. Well, then those people should
> also be strict about differentiating plain text from serialized documents.
>
> Back to Notepad - it produces documents, not plain text. For that
> matter, Microsoft should provide a plain text editor, or extend
> Notepad with that capability. But it is really up to them. They can
> leave it to other people to do it. After all, in Windows, you don't
> need a text editor. There is no plain text in Windows. Which is
> sometimes good, and sometimes bad.
>
>
Well, I think this depends on how you define "plain text". I define
"plain text" as a string of characters which represent text with no
markup etc. This is what plain text is on Windows. And when this string
of characters is saved as a file encoded in UTF-8, Windows (or at least
some Windows applications) indicates this encoding (as permitted but not
encouraged in the Unicode standard) by preceding the string of
characters with a BOM, which is not one of those characters. But your
definition of "plain text" seems rather different, more like a string of
arbitrary bytes which is supposed to have some interpretation as
characters but whose encoding is unknown at this level, rather like the
serialised data passed across my interface D above. This is perhaps
more like what Unix does in practice. But I don't think it is helpful to
define "plain text" in this way.

Windows presumes that batch files (a DOS concept) and all other
non-Unicode data (including that saved by Notepad in "ANSI" mode) are
encoded according to the system's default code page. This cannot be
UTF-8, and so these files cannot start with a BOM (although in principle
they can start with a UTF-8 BOM signature interpreted as three
characters in the code page). Of course the system gets confused if a
UTF-8 file is passed to a process which expects a file in a code page
format. This confusion might be reduced if Windows recognised BOM
signatures at the start of files opened by non-Unicode processes and
pre-converted them to the system code page (with loss of data for
characters not supported by the code page). But this strategy is
dangerous because BOM signatures are legal as bytes in legacy and binary
data, and because some non-Unicode processes intend to operate on the
data at the byte level. And so this is not done by default.

The implication of this is that the only safe way is to indicate every
file's encoding out of band. Unfortunately this cannot be done reliably.
Windows goes some way towards doing this with its file extension
mechanism. This actually makes it difficult to create batch files with
Notepad (the extension has to be changed manually), but it is still only
a partial answer.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005

Next message: Lokesh Joshi: "Need help for Arabic text processing"
Previous message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
In reply to: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Next in thread: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:50:52 CST