From: Hans Aberg (email@example.com)
Date: Fri Jan 21 2005 - 12:47:21 CST
On 2005/01/21 16:33, Peter Kirk at firstname.lastname@example.org wrote:
>> [Jill's Important Question 2]:
>> And the second question I must ask is: if a file is labelled by some
>> higher level protocol (for example, Unix locale, HTTP header, etc) as
>> "UTF-8", should a conformant process interpret that as UTF-8, the
>> Unicode Encoding FORM (which prohibits a BOM) or as UTF-8, the Unicode
>> Encoding SCHEME (which allows one)?
> Excellent question! And what if it is not labelled at all, but expected
> to be UTF-8?
Here, it seems, the higher level protocol should define what should happen
with BOM, just as with any other character. UTF-8 just means that the byte
sequence is well formed according to UTF-8.
The Unicode standard, it seems, is prone to misinterpretations on this
point. It should be rewritten. There appears to be no need for it to mention
the BOM, except as a curiosity note, noting that programs and other
protocols may treat it differently than its 0xFEFF glyph semantics. In this
respect, it is not different from any other valid character sequence in
Unicode. Shell script or PS markers do not make those files not conforming
to Unicode. Unicode, as a character protocol, just provides the characters
and encodings, but does not enforce any particular of programs behavior
This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:05:12 CST