Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jan 21 2005 - 09:33:35 CST

Next message: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"

Previous message: Peter Kirk: "Re: So how about U+D7FD for a NOP then?"
In reply to: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Next in thread: Hans Aberg: "Re: Conformance"
Reply: Hans Aberg: "Re: Conformance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 21/01/2005 12:25, Arcane Jill wrote:

> ... Okay, so you don't have to interpret ALL characters, and the BOM
> is just a character, so you don't have to interpret it. ...
>
> [Jill's Important Question 1]:
> So the first question I must ask is: Which of these two clauses takes
> precedence, C8 or C12b?
>
> If C12b takes precedence, then when a process interprets a byte
> sequence which purports to be in the Unicode Encoding Scheme UTF-8, it
> shall interpret that byte sequence according to the specifications for
> the use of the byte order mark established by the Unicode Standard for
> the Unicode Encoding Scheme UTF-8.
>
> But if C8 takes precedence, then a process shall not assume that it is
> required to interpret U+FEFF.
>
> They can't both be right.

This issue may seem arcane, Jill :-) , but it is central to the recent
dispute.

As I see it, your mistake here is to assume that "the BOM is just a
character". It is not, it is something quite different, an element in an
encoding scheme. And as such, according to C12b, its correct
interpretation is mandatory.

The byte sequence corresponding to a BOM, which depends on the encoding
scheme, has two possible interpretations. One of these interpretations
is the character U+FEFF ZERO WIDTH NO-BREAK SPACE. The other
interpretation is as a BOM, which is not a character at all and does not
form part of the string of characters which is encoded. As a BOM is not
a character, C8 does not apply.

In Unicode 1.0, as quoted by Ken yesterday, the BOM was referred to as a
"Unicode special character". But I note that the quotations from the
conformance clauses of Unicode 4.0 carefully avoid calling the BOM a
character. On the other hand, in section 15.9 of Unicode 4.0, although
this section describes "code points that are interpreted as neither
control nor graphic characters", the BOM is referred to as a "special
interpretation" of "the character U+FEFF". It seems to me that this
wording confuses the issue, especially because later in the same section
"U+FEFF also has significance as a character" refers only to the
interpretation as ZERO WIDTH NO-BREAK SPACE. For consistency, it would
be better to refer only to "the code point U+FEFF", or to "the character
U+FEFF" only when this code point is interpreted as ZERO WIDTH NO-BREAK
SPACE. This requires some rather minor editing to section 15.9.

>
> [Jill's Important Question 2]:
> And the second question I must ask is: if a file is labelled by some
> higher level protocol (for example, Unix locale, HTTP header, etc) as
> "UTF-8", should a conformant process interpret that as UTF-8, the
> Unicode Encoding FORM (which prohibits a BOM) or as UTF-8, the Unicode
> Encoding SCHEME (which allows one)?
>
Excellent question! And what if it is not labelled at all, but expected
to be UTF-8?

But meanwhile, a practical suggestion for Unix systems and users. Text
files originating on other systems may include a number of conventions
which are not native to Unix, such as CRLF for line breaks, and also
BOMs. For these to be processed correctly by Unix systems, they need to
be converted to use Unix conventions. Such a conversion would include
stripping out BOMs, and also perhaps (at least if the locale is UTF-8)
conversion from other UTF's to UTF-8. In the Windows world such a
conversion might be implemented best by specifying a new mode for
opening a file. But I guess that in the Unix world it would be best to
use a filter here. It would be rather trivial, using ICU or similar, to
write such a filter. This filter could be invoked by default when
opening or saving Internet downloads, e-mail attachments etc, perhaps
depending on the MIME type. Users might need to decide for themselves
whether to use this filter when reading files received from other
systems on removable media.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005

Next message: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"
Previous message: Peter Kirk: "Re: So how about U+D7FD for a NOP then?"
In reply to: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Next in thread: Hans Aberg: "Re: Conformance"
Reply: Hans Aberg: "Re: Conformance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 10:35:27 CST