Adding Behdad for his insight on the rendering stack.
But as for user requirements and expectations, the first option, with the
hyphen on the right side of "car" as "car-" is what a good publisher would
want to print in his magazine or book. The second option is harder to
decipher for an RTL reader.
(Note that breaking opposite-direction phrases across lines in bidi
paragraphs is also avoided as much as possible in good typography, as the
output is weird to some readers anyway.)
On Apr 1, 2014 1:21 PM, "Whistler, Ken" <ken.whistler_at_sap.com> wrote:
> I don’t think the answer is directly deduced from UAX #9, because
>
> it involves deciding where to insert a visible hyphen for display.
>
> However, I think the correct answer here is your number two guess,
>
> i.e. (in a RTL paragraph context):
>
>
>
> -car SI TORRAC
>
>
>
> A way to think about this, rather than starting from the BN nature
>
> of U+00AD, is to ask what would happen if there was an *explicit*
>
> hyphen-minus at the same position. Shortening your example
>
> line “CARROT IS car\u00AD” to just the equivalent of “ABC car-“,
>
> the outcome of the bidiref processing for a RTL paragraph context is:
>
>
>
> Trace: Entering br_UBA_ReverseLevels [L2]
>
> Current State: 19
>
> Text: 05D0 05D1 05D2 0020 0063 0061 0072 002D
>
> Bidi_Class: R R R R L L L R
>
> Levels: 1 1 1 1 2 2 2 1
>
> Runs: <R-----------------------------------R>
>
>
>
> Order: [7 4 5 6 3 2 1 0]
>
>
>
> In other words, on display:
>
>
>
> -car CBA
>
> <---------
>
>
>
> with the hyphen-minus at the *end* of the reordered line, as
>
> expected.
>
>
>
> If you run the same example, but substituting U+00AD for U+002D, you get:
>
>
>
> Trace: Entering br_UBA_ReverseLevels [L2]
>
> Current State: 19
>
> Text: 05D0 05D1 05D2 0020 0063 0061 0072 00AD
>
> Bidi_Class: R R R R L L L BN
>
> Levels: 1 1 1 1 2 2 2 x
>
> Runs: <R-----------------------------------R>
>
>
>
> Order: [4 5 6 3 2 1 0]
>
>
>
> And the display for that would be:
>
>
>
> car CBA
>
>
>
> But *then* your hyphenation algorithm would presumably kick in and decide
>
> that the U+00AD is at the end of the line and should display as a visible
>
> hyphen glyph. But “end of the line” here means the same as it would for
>
> the explicit hyphen-minus, so when you insert the visible hyphen glyph, you
>
> end up with the same result:
>
>
>
> -car CBA
>
>
>
> Another way of looking at this is that in order to line break your text in
>
> the first place, you need to be able to calculate the resolved display
> width
>
> to fit in the line. That would have to include the visual display of the
> inserted
>
> hyphen glyph. So once you have *decided* to break the line at the soft
>
> hyphen, in effect, you substitute a visual display symbol U+002D (or
>
> the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the
>
> results to get the resolved order of all the elements on the line. The net
>
> effect should be the same.
>
>
>
> Maybe folks with full implementations of bidi rendering would have more to
>
> contribute on this, but that would be my own take on the problem.
>
>
>
> --Ken
>
>
>
>
>
>
>
> Suppose I have a paragraph (uppercase = RTL):
>
>
>
> CARROT IS car\u00ADrot IN ENGLISH
>
>
>
> and the paragraph gets broken at the soft hyphen.
>
>
>
> Is the correct ordering for the first line
>
>
>
> car- SI TORRAC
>
>
>
> or
>
>
>
> -car SI TORRAC
>
>
>
> ? I did not succeed in deducing the answer from UAX#9. Soft hyphen has
> bidi class BN, which means it gets removed in stage X9, and so, if I have
> understood correctly, doesn't have a defined embedding level.
>
>
>
> I'm guessing the correct ordering is the first one, but I don't trust my
> instincts here. (In particular, I wondered whether this was analogous to
> the case where rule L1 resets embedding levels so that trailing whitespace
> is at the visual end of the line.)
>
>
>
> More generally, suppose you have a markup language which has a construct
> for discretionary breaks, as in TeX, with pre-break, post-break and
> no-break text. Soft hyphen is a special case of this (where the pre-break
> text consists of a hyphen, and the pos and no-break texts are empty); you
> can also regard space as a kind of discretionary break (post-break text
> empty, no-break text contains the space, pre-break text either contains the
> space or is empty, depending on how you want to think about it). Obviously
> the embedding level for the no-break text should be resolved as if
> discretionary break was replaced by the no-break text (which is consistent
> with a bidi class of BN for soft hyphen). However, for the pre- and
> post-break text, it is not clear to me what the right way is to resolve
> embedding levels (or how their content should be restricted so that there
> is a sensible way to resolve the embedding levels). I would be grateful for
> any suggestions.
>
>
>
> James
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Apr 01 2014 - 16:02:21 CDT
This archive was generated by hypermail 2.2.0 : Tue Apr 01 2014 - 16:02:21 CDT