RE: Bidi reordering of soft hyphen

From: Roozbeh Pournader <roozbeh_at_unicode.org>
Date: Tue, 1 Apr 2014 14:00:25 -0700

Adding Behdad for his insight on the rendering stack.

But as for user requirements and expectations, the first option, with the
hyphen on the right side of "car" as "car-" is what a good publisher would
want to print in his magazine or book. The second option is harder to
decipher for an RTL reader.

(Note that breaking opposite-direction phrases across lines in bidi
paragraphs is also avoided as much as possible in good typography, as the
output is weird to some readers anyway.)
On Apr 1, 2014 1:21 PM, "Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> I don’t think the answer is directly deduced from UAX #9, because
>
> it involves deciding where to insert a visible hyphen for display.
>
> However, I think the correct answer here is your number two guess,
>
> i.e. (in a RTL paragraph context):
>
>
>
> -car SI TORRAC
>
>
>
> A way to think about this, rather than starting from the BN nature
>
> of U+00AD, is to ask what would happen if there was an *explicit*
>
> hyphen-minus at the same position. Shortening your example
>
> line “CARROT IS car\u00AD” to just the equivalent of “ABC car-“,
>
> the outcome of the bidiref processing for a RTL paragraph context is:
>
>
>
> Trace: Entering br_UBA_ReverseLevels [L2]
>
> Current State: 19
>
> Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D
>
> Bidi_Class: R R R R L L L R
>
> Levels: 1 1 1 1 2 2 2 1
>
> Runs: <R-----------------------------------R>
>
>
>
> Order: [7 4 5 6 3 2 1 0]
>
>
>
> In other words, on display:
>
>
>
> -car CBA
>
> <---------
>
>
>
> with the hyphen-minus at the *end* of the reordered line, as
>
> expected.
>
>
>
> If you run the same example, but substituting U+00AD for U+002D, you get:
>
>
>
> Trace: Entering br_UBA_ReverseLevels [L2]
>
> Current State: 19
>
> Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD
>
> Bidi_Class: R R R R L L L BN
>
> Levels: 1 1 1 1 2 2 2 x
>
> Runs: <R-----------------------------------R>
>
>
>
> Order: [4 5 6 3 2 1 0]
>
>
>
> And the display for that would be:
>
>
>
> car CBA
>
>
>
> But *then* your hyphenation algorithm would presumably kick in and decide
>
> that the U+00AD is at the end of the line and should display as a visible
>
> hyphen glyph. But “end of the line” here means the same as it would for
>
> the explicit hyphen-minus, so when you insert the visible hyphen glyph, you
>
> end up with the same result:
>
>
>
> -car CBA
>
>
>
> Another way of looking at this is that in order to line break your text in
>
> the first place, you need to be able to calculate the resolved display
> width
>
> to fit in the line. That would have to include the visual display of the
> inserted
>
> hyphen glyph. So once you have *decided* to break the line at the soft
>
> hyphen, in effect, you substitute a visual display symbol U+002D (or
>
> the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the
>
> results to get the resolved order of all the elements on the line. The net
>
> effect should be the same.
>
>
>
> Maybe folks with full implementations of bidi rendering would have more to
>
> contribute on this, but that would be my own take on the problem.
>
>
>
> --Ken
>
>
>
>
>
>
>
> Suppose I have a paragraph (uppercase = RTL):
>
>
>
> CARROT IS car\u00ADrot IN ENGLISH
>
>
>
> and the paragraph gets broken at the soft hyphen.
>
>
>
> Is the correct ordering for the first line
>
>
>
> car- SI TORRAC
>
>
>
> or
>
>
>
> -car SI TORRAC
>
>
>
> ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has
> bidi class BN, which means it gets removed in stage X9, and so, if I have
> understood correctly, doesn't have a defined embedding level.
>
>
>
> I'm guessing the correct ordering is the first one, but I don't trust my
> instincts here. (In particular, I wondered whether this was analogous to
> the case where rule L1 resets embedding levels so that trailing whitespace
> is at the visual end of the line.)
>
>
>
> More generally, suppose you have a markup language which has a construct
> for discretionary breaks, as in TeX, with pre-break, post-break and
> no-break text. Soft hyphen is a special case of this (where the pre-break
> text consists of a hyphen, and the pos and no-break texts are empty); you
> can also regard space as a kind of discretionary break (post-break text
> empty, no-break text contains the space, pre-break text either contains the
> space or is empty, depending on how you want to think about it). Obviously
> the embedding level for the no-break text should be resolved as if
> discretionary break was replaced by the no-break text (which is consistent
> with a bidi class of BN for soft hyphen). However, for the pre- and
> post-break text, it is not clear to me what the right way is to resolve
> embedding levels (or how their content should be restricted so that there
> is a sensible way to resolve the embedding levels). I would be grateful for
> any suggestions.
>
>
>
> James
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Apr 01 2014 - 16:02:21 CDT

This archive was generated by hypermail 2.2.0 : Tue Apr 01 2014 - 16:02:21 CDT