RE: Unclear text in the UBA (UAX#9) of Unicode 6.3 from Whistler, Ken on 2014-04-21 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Tue, 22 Apr 2014 00:44:12 +0000

Ilya noted:

> [Below, I completely ignore BIDI part of the specification, and

> concentrate ONLY on the parens match. I do not understand why this

> question is interlaced with BIDI determination; I trust that it is.]

Actually, it is, because the bracket-matching is really only interesting

in the cases where the boundaries of the isolating runs are in

question, and there are some directional differences in the runs.

The whole point of introducing the paired bracket complication was

to deal with edge cases for that, but...

> So one may ask: what will be the result of the CURRENT UNICODE parsing

> applied

> to Phillipe’s example?

>

> This is an [«] example [»] for demonstration only.

That is easily answered. Let's crank up the bidi reference code with

a shorter example that contains the relevant units: a [«] b [»] c

Turn up the trace output to see what rule N0 is actually doing,

and you get the following. (Set your display wide enough to not wrap the output

lines, for best interpretation.)

Trace: Entering br_UBA_ResolvePairedBrackets [N0]

Trace: br_PushBracketStack, bracket=005D, pos=2

Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810

Trace: br_PeekBracketStack, bracket=005D, pos=2

Appended pair: opening pos 2, closing pos 4

Trace: br_PopBracketStack, #elements=1

Matched bracket

Trace: br_PushBracketStack, bracket=005D, pos=8

Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810

Trace: br_PeekBracketStack, bracket=005D, pos=8

Appended pair: opening pos 8, closing pos 10

Trace: br_PopBracketStack, #elements=1

Matched bracket

Trace: Entering br_SortPairList

Pair list: {2,4} {8,10}

Append at end

Trace: Exiting br_SortPairList

Pair list: {2,4} {8,10}

Debug: No strong direction between brackets

Debug: No strong direction between brackets

Current State: 14

  Text: 0061 0020 005B 00AB 005D 0020 0062 0020 005B 00BB 005D 0020 0063

  Bidi_Class: L WS ON ON ON WS L WS ON ON ON WS L

  Levels: 0 0 0 0 0 0 0 0 0 0 0 0 0

  Runs: <L------------------------------------------------------------L>

Because of the way the stack processing is defined, the first bracket pair is [«]

and the second bracket pair is [»]. The algorithm does not push down potential

matches while seeking for a largest outer pair to match. One could – particularly

if one is mathematically inclined – argue that that is not the right way to do the

matching, but it *is* the way the algorithm is currently defined. And it is the

way both of the bidi reference implementations, all of the BidiCharacterTest.txt

data, the ICU implementation, the Microsoft implementation, and the Harfbuzz

implementation are defined, to the best of my knowledge. Other implementations

would have to be doing the same, or they would be failing the conformance tests

in BidiCharacterTest.txt.

Note that for an all left-to-right run of text like this, with no isolating runs and

no embeddings, the implications of rule N0 are trivial and non-interesting. The

bracket matches don’t end up *doing* anything relevant to the text reordering

for bidi in this example. But once you start mixing directions of text and adding embeddings

and isolating runs, then things get complicated in non-trivial ways for the output.

--Ken

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Mon Apr 21 2014 - 19:45:22 CDT

This archive was generated by hypermail 2.2.0 : Mon Apr 21 2014 - 19:45:22 CDT