Re: UTF-8N?

From: John Cowan (jcowan@reutershealth.com)
Date: Thu Jun 22 2000 - 13:49:23 EDT


Antoine Leca wrote:

> Now I ask a slighty different question then. What is the name of the
> encoding where the byte order is known (for example, any application
> on an Intel machine that receive its data from the system, as opposed
> as from the network or similar hazardous source), and where a
> received BOM should be silently eaten up?

If it is a 16-bit encoding, then its name is UTF-16. The reason for
having three different names "UTF-16", "UTF-16BE", "UTF-16LE" is that
the same byte sequence "0xFE 0xFF 0x00 0x20" is interpreted in three
different ways:

        in UTF-16, it is a U+0020
        in UTF-16BE, it is a U+FEFF U+0020
        in UTF-16LE, it is a U+FFFE U+2000

In the last case, of course, U+FFFE is not a character, so the byte
sequence is ill-formed, just as it would be ill-formed if the
charset was US-ASCII (because 0xFE and 0xFF are illegal in that charset).

But there is no need to distinguish between the charset that encodes
U+0020 as 0xFE 0xFF 0x00 0x20 and the one that encodes it as 0xFF 0xFE 0x20 0x00,
because the decoding is unambiguous and that's all that counts; both
cases are labeled "UTF-16".

> If I am right, the correct way to encode a initiating ZWNBSP in UTF-8
> would then be code 0xEF 0xBB 0xBF 0xEF 0xBB 0xBF.

It's *a* correct way. But many (most?) UTF-8 interpreters have no notion
of a BOM, and will decode this as U+FEFF U+FEFF.
 

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT