Re: UTF-8N?

From: John Cowan (jcowan@reutershealth.com)
Date: Thu Jun 22 2000 - 14:08:15 EDT


"Ayers, Mike" wrote:

> Am I reading this wrong? Here's what I get:
>
> I hand you a UTF-16 document. This document is:
>
> FE FF 00 48 00 65 00 6C 00 6C 00 6F
>
> ..so it says "Hello". Then I say, "Oh, by the way, that's
> big-endian." *POOF* The content of the document has changed, and there is
> now a 'ZERO WIDTH NO BREAK SPACE' at the beginning. Smells pretty skunky...

No, what you have said is that this document is in "UTF16-BE" encoding.
That's a name for an encoding that is known a priori to be BE, and does
not permit a BOM. It is not the name for an encoding that has a BOM but
just happens to be BE.

Since you have changed the encoding, the content has naturally
changed too, just as if you had declared an 8859-1 document
to be 8859-2.

> BTW, what is a ZWNBSP anyway? From here it seems like a
> non-character. Is there an actual use for it?

Yes. It indicates that a line break may not be introduced at this point.
It is similar to the NO-BREAK SPACE (U+00A0) which you may be familiar
with under its HTML name of  , except that it doesn't produce any actual
whitespace. ZWNBSP is useful in languages that don't use whitespace, and
in strings like "M.T.A." where a line breaker might be tempted to break after
a period.

Its opposite number is ZWSP (U+200B), which likewise doesn't generate any
actual whitespace, but indicates that line breaking *is* permitted here.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT