Re: UTF-8N?

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Jun 21 2000 - 16:36:35 EDT


Peter_Constable@sil.org wrote:

> Eh??? John, either I'm really missing your intent, or you're saying
> something that I know you don't mean. U+0020 in UTF-8 is always 0x20,
> whether or not the file begins with a BOM.

Sure. Let me try again at more length.

Encodings are mappings between sequences of characters and sequences
of bytes. Suppose we have a character sequence that begins with
the character U+0020. Here are some possible encodings of that sequence
into bytes:

US-ASCII: 0x20 ...
UTF-16: 0xFE 0xFF 0x00 0x20 ...
UTF-16: 0xFF 0xFE 0x20 0x00 ...
UTF-16BE: 0x00 0x20 ...
UTF-16LE: 0x20 0x00 ...
UTF-8N: 0x20 ...
UTF-8B: 0xEF 0xBB 0x BF 0x20 ...

Now suppose we have a character sequence beginning with U+FEFF U+0020.
This would be encoded as follows:

US-ASCII: (not possible)
UTF-16: 0xFE 0xFF 0xFE 0xFF 0x00 0x20 ...
UTF-16: 0xFF 0xFE 0xFF 0xFE 0x20 0x00 ...
UTF-16BE: 0xFE 0xFF 0x00 0x20 ...
UTF-16LE: 0xFF 0xFE 0x20 0x00 ...
UTF-8N: 0xEF 0xBB 0xBF 0x20 ...
UTF-8B: 0xEF 0xBB 0xBF 0xEF 0xBB 0xBF 0x20 ...

Without distinct labels UTF-8N and UTF-8B (or whatever), we cannot tell
if the byte sequence 0xEF 0xBB 0xBF 0x20 should be decoded as U+0020 or
U+FEFF U+0020. This is exactly analogous to the statement that without
distinct labels UTF-16 and UTF-16BE, we cannot tell if the byte sequence
0xFE 0xFF 0x00 0x20 should be decoded as U+0020 or U+FEFF U+0020.

The counterargument is that the sequence U+FEFF U+0020 simply makes no sense,
and the case is not worth worrying about. The rejoinders to *that* are:
1) it can be represented in UTF-16 of any flavor, and the mapping from UTF-16
to UTF-8 must be 1-1 and reversible, and 2) there is no such thing in Unicode
as a forbidden sequence of characters.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT