Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

From: Asmus Freytag (w) <asmusf_at_ix.netcom.com>
Date: Fri, 5 Aug 2011 14:21:43 -0700 (GMT-07:00)

The ambiguity of an initial FEFF was not desirable, but this discussion shows that certain things can't be so easily "fixed" by adding characters at a later stage.

The more time elapsed between encoding of the ambiguous character and the later "fix" the more software, the more data, and the more protocols exist that support the original character, creating backwards compatibility issues.

Incidentally, this is totally what I expected when the WJ was proposed, but sentiment in favor of its addition ran high at the time...

The ZWNBSP was present in Unicode 1.0 (1991) while the WJ was added in 3.2 (2002), that is about 10 years later. We are now an additional 10 years down the road, and instead of clarifying the issue, the interim result is that WJ has muddied the waters instead.

Somewhere here are lessons to be learned.

A./

-----Original Message-----
>From: Doug Ewell <doug_at_ewellic.org>
>Sent: Aug 5, 2011 8:49 AM
>To: unicode_at_unicode.org
>Subject: Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)
>
>Jukka K. Korpela <jkorpela at cs dot tut dot fi> wrote:
>
>> So? It was, and it still often is, better to use ISO 8859-1 rather
>> than Unicode, in situations where there no tangible benefit, or just a
>> smal l benefit, from using Unicode. For example, many people are still
>> conservative about encodings in e-mail, for good reasons, so they use
>> ISO 8859-1 or, as you did in your message, windows-1252.
>
>A word about my encoding "choices." My first message on Thursday was
>sent from my home PC, using Windows Live Mail, and it used UTF-8 because
>I configured Windows Live Mail to do so. My second message was sent
>from my mobile device, and used Windows-1252. I don't know if there is
>a way to tell the device to use UTF-8 for outgoing messages, but I can
>say it was not my conscious intent to prefer Windows-1252 over Unicode.
>
>This message is being sent via a Web interface; I guess we'll find out
>what encoding it chooses for me.
>
>> On the other hand, this isn’t comparable to ZWNBSP vs. WJ. These
>> control characters do the same job in text, as per the standard, so
>> the practical question is simply which one is better supported.
>
>ZWNBSP, like WJ, is intended to inhibit breaking between words. Despite
>the other (and original) intended use of U+FEFF at the start of a text
>as a byte-order mark, there is a pervasive belief that an initial U+FEFF
>means the text should be treated as beginning with some kind of space
>character. This is silly, since there is no concept of "between words"
>at the start of a text, but it is nevertheless the way people perceive
>things.
>
>WJ was introduced to encourage users to separate these two functions.
>If users don't adopt it, the problem will never be solved. There are
>enough issues in Unicode that cannot be fixed due to stability concerns;
>it would be nice to be able to fix this one at least.
>
>I still question how many real-world texts use either U+FEFF or U+2060
>to achieve this non-breaking behavior.
>
>> ISO 8859-1 and Unicode perform very different jobs, so that using ISO
>> 8859-1, you limit your character repertoire (at least as regards to
>> directly representable characters, as opposite to various “escape
>> notations”). If you don’t need anything outside the ISO 8859-1, the
>> choice used to be very simple, though nowadays it has become a little
>> more complicated (as e.g. Google Groups seems to munge ISO 8859-1 data
>> in quotations but processes UTF-8 properly)
>
>UTF-8 has the property of being easily detected and verified as such,
>which solves part of the Google Groups problem (inability to detect
>which SBCS is being used). The other part of the problem is the
>practice of using heuristics to override an explicit charset
>declaration, but that is a topic for another day.
>
>> I won’t make any statements about full compliance, but in Microsoft
>> Office Word 2007, U+FEFF alias ZWNBSP does its basic job (inside text)
>> in most situations whereas U+2060 alias WJ seems to be not recognized
>> at all and appears as some sort of a visible box. So to have a job
>> jone, there is not much of a choice. (Word 2007 fails to honor ZWNBSP
>> semantics after EN DASH, which is bad, but it does not make it useless
>> in other situations.)
>
>It does always come down to a complaint against Microsoft, doesn't it?
>Unfortunately, Yucca is right here: opening Word 2007 and pasting a
>snippet of text with embedded ZWNBSP does display correctly, while the
>same experiment with embedded WJ shows a .notdef box. This seems to be
>a font-coverage problem, amplified by Word's silent overriding of user
>font choices—changing the font from the default Calibri to DejaVu Sans
>(and optionally back to Calibri) makes the display problem go away, but
>of course no user could reasonably be expected to go through that.
>
>--
>Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
>www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­
>
>
>
>
Received on Fri Aug 05 2011 - 16:24:56 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 05 2011 - 16:24:57 CDT