Re: UTF-8N?

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Jun 21 2000 - 18:33:57 EDT


Peter_Constable@sil.org wrote:

> The BOM is explicitly not to be interpreted as part of
> the text stream. D35 (U3, p47) states (at least for UTF-16):
>
> "The byte order mark is not considered part of the content of the text."

Absolutely. What that means is that if there is a BOM, it is not translated
into a character; per contra, when encoding the first character, a BOM is prefixed
to its byte representation.

> The standard doesn't ever discuss the BOM in the context of UTF-8,

See section 13.6 (page 324).

> By the way, I don't know why you singled out U+0020 here; your claim could
> equally have been made about any other character (and would have been
> equally inaccurate).

Any other character, yes; inaccurate, no.

> [U+FEFF U+0020:] An unlikely initial character sequence,

But legal.

> This isn't analogous to UTF-16 since
> D33 - D35 spell out how an initial U+FEFF is to be interpreted (though it
> would be analogous if D33 - D35 didn't make that clear - perhaps that's
> what you meant).

For the "UTF-16" encoding, yes. For the encodings "UTF-16BE" and "UTF-16LE"
defined in D33-34, no. However, D35 tolerates using the term "UTF-16" in
either a specific or a generic sense.

> - A UTF-8 file that begins with the byte sequence 0xEF 0xBB 0x BF 0x20 ...
> could be interpreted as either < ZWNBSP U+0020 ... >, or as BOM < U+0020
> ... > (where I'm using angle brackets to denote the start and end of the
> content of text). Furthermore, there is nothing to indicate which
> interpretation is correct. (On this we agree.)

Yes. And thus new charset labels need to be introduced to distinguish
the two cases. A charset label, as RFC 1345 says, "unambiguously and
completely determines which sequence of characters, if any, is
represented by each possible sequence of n-bit bytes for a certain
value of n." The label "UTF-8" does not do so.

(I am not to be understood as favoring this result: it would be much
better to suppress 8-BOMs, and talk only of UTF-8. But that's not what
Unicode 3.0 entails.)

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT