Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

From: Doug Ewell <>
Date: Fri, 05 Aug 2011 08:49:00 -0700

Jukka K. Korpela <jkorpela at cs dot tut dot fi> wrote:

> So? It was, and it still often is, better to use ISO 8859-1 rather
> than Unicode, in situations where there no tangible benefit, or just a
> smal l benefit, from using Unicode. For example, many people are still
> conservative about encodings in e-mail, for good reasons, so they use
> ISO 8859-1 or, as you did in your message, windows-1252.

A word about my encoding "choices." My first message on Thursday was
sent from my home PC, using Windows Live Mail, and it used UTF-8 because
I configured Windows Live Mail to do so. My second message was sent
from my mobile device, and used Windows-1252. I don't know if there is
a way to tell the device to use UTF-8 for outgoing messages, but I can
say it was not my conscious intent to prefer Windows-1252 over Unicode.

This message is being sent via a Web interface; I guess we'll find out
what encoding it chooses for me.

> On the other hand, this isn’t comparable to ZWNBSP vs. WJ. These
> control characters do the same job in text, as per the standard, so
> the practical question is simply which one is better supported.

ZWNBSP, like WJ, is intended to inhibit breaking between words. Despite
the other (and original) intended use of U+FEFF at the start of a text
as a byte-order mark, there is a pervasive belief that an initial U+FEFF
means the text should be treated as beginning with some kind of space
character. This is silly, since there is no concept of "between words"
at the start of a text, but it is nevertheless the way people perceive

WJ was introduced to encourage users to separate these two functions.
If users don't adopt it, the problem will never be solved. There are
enough issues in Unicode that cannot be fixed due to stability concerns;
it would be nice to be able to fix this one at least.

I still question how many real-world texts use either U+FEFF or U+2060
to achieve this non-breaking behavior.

> ISO 8859-1 and Unicode perform very different jobs, so that using ISO
> 8859-1, you limit your character repertoire (at least as regards to
> directly representable characters, as opposite to various “escape
> notations”). If you don’t need anything outside the ISO 8859-1, the
> choice used to be very simple, though nowadays it has become a little
> more complicated (as e.g. Google Groups seems to munge ISO 8859-1 data
> in quotations but processes UTF-8 properly)

UTF-8 has the property of being easily detected and verified as such,
which solves part of the Google Groups problem (inability to detect
which SBCS is being used). The other part of the problem is the
practice of using heuristics to override an explicit charset
declaration, but that is a topic for another day.

> I won’t make any statements about full compliance, but in Microsoft
> Office Word 2007, U+FEFF alias ZWNBSP does its basic job (inside text)
> in most situations whereas U+2060 alias WJ seems to be not recognized
> at all and appears as some sort of a visible box. So to have a job
> jone, there is not much of a choice. (Word 2007 fails to honor ZWNBSP
> semantics after EN DASH, which is bad, but it does not make it useless
> in other situations.)

It does always come down to a complaint against Microsoft, doesn't it?
Unfortunately, Yucca is right here: opening Word 2007 and pasting a
snippet of text with embedded ZWNBSP does display correctly, while the
same experiment with embedded WJ shows a .notdef box. This seems to be
a font-coverage problem, amplified by Word's silent overriding of user
font choices—changing the font from the default Calibri to DejaVu Sans
(and optionally back to Calibri) makes the display problem go away, but
of course no user could reasonably be expected to go through that.

Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 | | @DougEwell ­
Received on Fri Aug 05 2011 - 10:53:07 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 05 2011 - 10:53:08 CDT