Re: UTF-8N?

From: John Cowan ([email protected])
Date: Thu Jun 22 2000 - 13:49:23 EDT

Next message: John Cowan: "Re: UTF-8 BOM Nonsense"
Previous message: Karlsson Kent - keka: "RE: UTF-8 BOM Nonsense"
Maybe in reply to: Masahiko Maedera: "UTF-8N?"
Next in thread: Kenneth Whistler: "Re: UTF-8N?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Antoine Leca wrote:

> Now I ask a slighty different question then. What is the name of the
> encoding where the byte order is known (for example, any application
> on an Intel machine that receive its data from the system, as opposed
> as from the network or similar hazardous source), and where a
> received BOM should be silently eaten up?

If it is a 16-bit encoding, then its name is UTF-16. The reason for
having three different names "UTF-16", "UTF-16BE", "UTF-16LE" is that
the same byte sequence "0xFE 0xFF 0x00 0x20" is interpreted in three
different ways:

        in UTF-16, it is a U+0020
        in UTF-16BE, it is a U+FEFF U+0020
        in UTF-16LE, it is a U+FFFE U+2000

In the last case, of course, U+FFFE is not a character, so the byte
sequence is ill-formed, just as it would be ill-formed if the
charset was US-ASCII (because 0xFE and 0xFF are illegal in that charset).

But there is no need to distinguish between the charset that encodes
U+0020 as 0xFE 0xFF 0x00 0x20 and the one that encodes it as 0xFF 0xFE 0x20 0x00,
because the decoding is unambiguous and that's all that counts; both
cases are labeled "UTF-16".

> If I am right, the correct way to encode a initiating ZWNBSP in UTF-8
> would then be code 0xEF 0xBB 0xBF 0xEF 0xBB 0xBF.

It's *a* correct way. But many (most?) UTF-8 interpreters have no notion
of a BOM, and will decode this as U+FEFF U+FEFF.

-- 
Schlingt dreifach einen Kreis um dies! || John Cowan <[email protected]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Next message: John Cowan: "Re: UTF-8 BOM Nonsense"
Previous message: Karlsson Kent - keka: "RE: UTF-8 BOM Nonsense"
Maybe in reply to: Masahiko Maedera: "UTF-8N?"
Next in thread: Kenneth Whistler: "Re: UTF-8N?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT