From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Nov 26 2004 - 18:23:36 CST
On 26/11/2004 23:24, Doug Ewell wrote:
> ...
>
>Most "break opportunities" are between words, a concept often indicated
>by an ordinary space (U+0020). So you wouldn't generally have to
>precede *every* combination of NBSP+combining mark with ZWSP "to ensure
>a break opportunity," only those combinations preceded by a character
>other than U+0020 that might inhibit the break. For example, if you
>wanted to ensure a break opportunity following U+2014 EM DASH, you would
>probably use the ZWSP, but you don't have to use it everywhere.
>
As I understand it (and I asked for confirmation of this but have not
received it), according to the current version of UAX #14 there is no
break opportunity between SPACE and NBSP, because rule LB11b precedes
rule LB12, although there is a note "Many existing implementations
reverse the order of precedence between rules LB11b and LB12." There is
a proposed update to UAX #14 which has the effect of reversing these
rules (except for WJ). But until this change has been accepted and fully
implemented, surely I need to use the ZWSP. Indeed, to be safe I will
always need the ZWSP as I can never be sure that the update has been
implemented.
>
>I also wonder whether the RLM is needed for a construction that is
>expected to occur amid a sea of Hebrew. U+00A0 is of type CS, which is
>weak directional, meaning its directionality is dictated by that of
>surrounding characters. If the surrounding characters are Hebrew (RTL),
>the RLM seems redundant (though of course not "forbidden").
>
>
The point here is that individual Hebrew words and short phrases are
often embedded within LTR text, which may be some kind of markup. I
don't want to see Hebrew words being garbled because markup has been
added, or because they have been quoted in an otherwise LTR document. So
again the safest thing is to use the RLM in every case, and to keep it
with the rest of the word e.g. when copying and pasting.
In fact this apparently leads to a small problem with text boundaries.
If I understand it correctly from UAX #29, in the combination <SPACE,
RLM, X>, where X is any character which might form part of a word
(including NBSP), the word boundary will be between RLM (as with any
other format character) and X, not between SPACE and RLM. Is that
correct? Or are both word boundaries? If so, this seems undesirable. In
such a situation, RLM affects what follows, not what precedes, and so
the word etc boundary should be only before RLM. Is this perhaps a
change which should be made to UAX #29? My proposal would be to add
rules for certain format characters (RLM, LRM, LRO, RLO, LRE, RLE,
perhaps others?) which prevent a word break after these characters and
before any ALetter or Numeric. But for PDF the rule should perhaps
prevent a word break before it.
Perhaps this discussion should be moved to the bidi list?
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 18:59:42 CST