Re: No Invisible Character - NBSP at the start of a word

From: Asmus Freytag (
Date: Sat Nov 27 2004 - 15:48:21 CST

  • Next message: Asmus Freytag: "Re: CGJ , RLM"

    At 04:23 PM 11/26/2004, Peter Kirk wrote:
    >As I understand it (and I asked for confirmation of this but have not
    >received it), according to the current version of UAX #14 there is no
    >break opportunity between SPACE and NBSP, because rule LB11b precedes rule
    >LB12, although there is a note "Many existing implementations reverse the
    >order of precedence between rules LB11b and LB12." There is a proposed
    >update to UAX #14 which has the effect of reversing these rules (except
    >for WJ). But until this change has been accepted and fully implemented,
    >surely I need to use the ZWSP. Indeed, to be safe I will always need the
    >ZWSP as I can never be sure that the update has been implemented.

    This is a fine case of mis-applied conservatism.

    The issue of relative *strength* of NBSP and SPACE predates Unicode, since
    both characters are already available as part of 8859-1 and many other
    character sets based on or equivalent to this standard.

    The change that the UTC has approved for UAX#14 simply recognizes the fact
    that this was not an open issue for Unicode to settle, but an issue long
    settled by custom, with implementations found to favor what is now also the
    officially recommended approach.

    Getting the recommendation in line with existing practice is important to
    allow users like you to rely on the behavior of certain specialized
    characters, such as NBSP and SPACE, so that you don't need to try to add
    ZWSP on suspicion.

    It's important to note that, largely, the specification in UAX#14 are not
    mandatory, by the way, nor can they be correct for all publishing styles,
    languages or types of documents. And they completely punt on South East
    Asian scripts, by the way, since those require a different type of algorithm.

    They are intended as a pretty serviceable baseline, which, for many not so
    demanding applications, could be implemented as-is, and which could serve
    as a basis for further tailoring for more sophisticated implementations.

    There simply is no portable way to guarantee exactly the same linebreak
    behavior across implementations, across protocols and across markup
    languages. Where such stability is required, say for legal documents, you
    are limited to any of the protocols that express final form documents, such
    as PDF.

    If you just load up your text with ZWSP you run the risk of encountering an
    implementation that does not support ZWSP at all, with potentially
    interesting (and unintended) results. I believe your risks there are much
    greater than expecting that there is a break between SPACE and NBSP.


    PS: The revised text of UAX#14 will not be published until Unicode 4.1, but
    the change to the rules has been endorsed by the UTC. While the UTC can
    change its mind before publication, it could do so after publication as
    well. This is different from assigning character codes, as you know.

    This archive was generated by hypermail 2.1.5 : Sat Nov 27 2004 - 15:50:38 CST