From: Peter Kirk (firstname.lastname@example.org)
Date: Mon Jan 24 2005 - 19:31:06 CST
On 24/01/2005 20:34, Peter Constable wrote:
>>However, (and this is where it gets complicated):
>>In the UTF-8, UTF-16, and UTF-32 Unicode encoding FORMs, an initial
>>represents ZERO WIDTH NO-BREAK SPACE.
>>In the UTF-8, UTF-16, and UTF-32 Unicode encoding SCHEMEs, an initial
>>U+FEFF is not considered part of the text.
>Careful; you're not distinguishing between the coded character U+FEFF
>and an octet sequence such as 0xFF 0xFE, or 0xEF 0xBB 0xBF. One is an
>entity of a CCS; the others are sequences of entities at the level of
>CES. The Standard indicates that an initial octet sequence 0xEF 0xBB
>0xBF *may* be interpreted at the CES level as a UTF-8 BOM. The Standard
>neither requires nor recommends this, however; a conformant process may
>rather interpret that initial sequence as the coded character
>representation U+FEFF, which in turn it may or may not interpret as
Peter, I am surprised at your "The Standard neither requires nor
recommends this", as I read it as stating the opposite, that this
interpretation is mandatory. I note from C12b as quoted by Jill:
> C12b: When a process interprets a byte sequence which purports to be
> in a Unicode character encoding scheme, it shall interpret that byte
> sequence according to the byte order and specifications for the use of
> the byte order mark established by this standard for that character
> encoding scheme.
And then in the definition of the UTF-8 encoding scheme, D39:
> When represented in UTF-8, the byte order mark turns into the byte
> sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream
> is neither required nor recommended by the Unicode Standard, but its
> presence does not affect conformance to the UTF-8 encoding scheme.
Thus the "specifications for the use of the byte order mark established
by this standard for that character encoding scheme" include that the
sequence <EF BB BF> at the beginning of a UTF-8 data stream represents a
BOM. While this usage of the BOM is "neither required nor recommended by
the Unicode Standard", the wording in C12b "it shall interpret" surely
implies that the interpretation of this byte sequence as either a BOM or
U+FEFF is mandatory, and that a process which interprets the sequence
always as the coded character representation U+FEFF is non-conformant.
But I note from the last paragraph of p.81 of TUS 4.0 that there is a
serious issue of ambiguity here. In my opinion, the ambiguity could be
resolved if it was clearly specified that the sequence <EF BB BF> at the
beginning of a UTF-8 data stream always represents a BOM, and not U+FEFF
- while maintaining the recommendation against using this sequence,
except in the very rare case in which a UTF-8 data stream is intended to
start with the deprecated character U+FEFF, in which case it should be
mandatory to include the BOM and so start the file with <EF BB BF EF BB BF>.
-- Peter Kirk email@example.com (personal) firstname.lastname@example.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005
This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:38:09 CST