Re: No Invisible Character - NBSP at the start of a word

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Nov 26 2004 - 18:23:36 CST

  • Next message: John Cowan: "Re: Misuse of 8th bit [Was: My Querry]"

    On 26/11/2004 23:24, Doug Ewell wrote:

    > ...
    >
    >Most "break opportunities" are between words, a concept often indicated
    >by an ordinary space (U+0020). So you wouldn't generally have to
    >precede *every* combination of NBSP+combining mark with ZWSP "to ensure
    >a break opportunity," only those combinations preceded by a character
    >other than U+0020 that might inhibit the break. For example, if you
    >wanted to ensure a break opportunity following U+2014 EM DASH, you would
    >probably use the ZWSP, but you don't have to use it everywhere.
    >

    As I understand it (and I asked for confirmation of this but have not
    received it), according to the current version of UAX #14 there is no
    break opportunity between SPACE and NBSP, because rule LB11b precedes
    rule LB12, although there is a note "Many existing implementations
    reverse the order of precedence between rules LB11b and LB12." There is
    a proposed update to UAX #14 which has the effect of reversing these
    rules (except for WJ). But until this change has been accepted and fully
    implemented, surely I need to use the ZWSP. Indeed, to be safe I will
    always need the ZWSP as I can never be sure that the update has been
    implemented.

    >
    >I also wonder whether the RLM is needed for a construction that is
    >expected to occur amid a sea of Hebrew. U+00A0 is of type CS, which is
    >weak directional, meaning its directionality is dictated by that of
    >surrounding characters. If the surrounding characters are Hebrew (RTL),
    >the RLM seems redundant (though of course not "forbidden").
    >
    >
    The point here is that individual Hebrew words and short phrases are
    often embedded within LTR text, which may be some kind of markup. I
    don't want to see Hebrew words being garbled because markup has been
    added, or because they have been quoted in an otherwise LTR document. So
    again the safest thing is to use the RLM in every case, and to keep it
    with the rest of the word e.g. when copying and pasting.

    In fact this apparently leads to a small problem with text boundaries.
    If I understand it correctly from UAX #29, in the combination <SPACE,
    RLM, X>, where X is any character which might form part of a word
    (including NBSP), the word boundary will be between RLM (as with any
    other format character) and X, not between SPACE and RLM. Is that
    correct? Or are both word boundaries? If so, this seems undesirable. In
    such a situation, RLM affects what follows, not what precedes, and so
    the word etc boundary should be only before RLM. Is this perhaps a
    change which should be made to UAX #29? My proposal would be to add
    rules for certain format characters (RLM, LRM, LRO, RLO, LRE, RLE,
    perhaps others?) which prevent a word break after these characters and
    before any ALetter or Numeric. But for PDF the rule should perhaps
    prevent a word break before it.

    Perhaps this discussion should be moved to the bidi list?

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 18:59:42 CST