Proposed Update UAX #9, Unicode Bidirectional Algorithm from CE Whitehead on 2013-01-19 (Unicode Mail List Archive)

From: CE Whitehead <cewcathar_at_hotmail.com>
Date: Sat, 19 Jan 2013 13:11:33 -0500

Hi, I am commenting on Marcin
Grzegorczyk's comments here; I also have one comment on Phillipe Verdy's
comments, which follow Marcin's in the feedback page (http://www.unicode.org/review/pri232/) Hope it's clear who is speaking below.

Date/Time: Thu Dec 13 17:39:27 CST 2012

Contact: mgrzegor_at_poczta.onet.pl

Name: Marcin Grzegorczyk

Report Type: Public Review Issue

Opt Subject: Feedback on PRI #232

> In addition to Aharon Lanin’s comments, I would like to point out
that the term “external neighbor” in the proposed rule N0 is ambiguous

without a definition. It could mean either the adjacent character,
or the nearest strong type; and in both cases sos/eos may be included or
not.

> Also, I am not happy about the idea of having the UBA refer to
properties not directly related to bidi (namely, the General Category
property). In fact, since the proposed update already adds new bidi
classes for isolates, it might add new bidi classes for paired
punctuation as well. I believe it would not only allow for more
flexibility (e.g. if a need arises to include characters of a different
General Category), but also enable expressing rule N0 more clearly.

> Below is my list of proposed changes (relative to UAX #9 rev. 28 draft 4) based on this idea.

>------

> Add two new values to X_Bidi_Class (or Bidi_Class_X as per Asmus’s
suggestion): Opening_Punctuation (OP) and Closing_Punctuation (CP).

> Assign OP to all characters with General_Category=Open_Punctuation for which Bidi_Mirroring_Glyph is not .

I like this idea, which has been discussed prevously.

> Assign CP to all characters with General_Category=Close_Punctuation for which Bidi_Mirroring_Glyph is not .

> In Tables 3 and 4, add the new two classes to the Neutral category.

O.k. so far.

> [Note: I believe that as of 6.2.0 all characters with gc=Ps or gc=Pe have bc=ON.]

This is my understanding, too.

> Add a new definition:

> BD11. Character A forms a mirrored pair with character B if the
property Bidi_Mirrored is Yes for both A and B, and Bidi_Mirroring_Glyph
of A is B.

> Rephrase N0 as follows:

> N0. Search backward from each instance of a closing punctuation
(CP) until either the first opening punctuation (OP) or sos is found. If
an OP > is found, and it does not form a mirrored pair with the CP
character, change that OP and all OPs preceding it in the isolating run
sequence to

> Other Neutral (ON). [1] If an OP is found, and it forms a mirrored pair with the CP character, then:

> If the text between the OP and the CP contains at least one
non-neutral type [2] (L, R, EN or AN) of the same direction as the
embedding > direction [3], change both the OP and the CP to the
strong type (L or R) corresponding to the embedding direction.

> Otherwise, if the text between the OP and the CP contains at
least one non-neutral type of the direction opposite to the embedding

> direction,

> and at least one of the following conditions is true:

> the last non-neutral type, if any, preceding the OP [4]
is also of the direction opposite to the embedding direction,

> the first non-neutral type, if any, following the CP is
also of the direction opposite to the embedding direction,

> then change both the OP and the CP to the strong type opposite to the embedding direction.

> Otherwise, change both the OP and the CP to ON. [5]

I do think mirrored characters need to be addressed in UAX 9, and so far they are,
" Paired punctuation marks are considered as a pair so that they both resolve to the same direction."

(http://www.unicode.org/reports/tr9/tr9-28.html#Resolving_Neutral_Types)

but I am not completely in agreement with Marcin's algorithm above.

The original algorithm discussed for mirrored pairs (which I like; this
algorithm may be found at:
http://www.unicode.org/review/pri231/pri231-background.pdf) was, as I
understand things (I am quoting here):
"Once the paired punctuation marks have been identified,
they should be resolved to the embedding direction except in the
following cases which are resolved, based on context, opposite the
embedding direction:

"* The directionality of the enclosed content is opposite the embedding
direction, and at least one 115 neighbor has a bidi level opposite to
the embedding direction O(O)E, E(O)O, or O(O)O.

"*The enclosed content is neutral and both neighbors have a bidi level
opposite to the embedding direction O(N)O. Resolving to opposite to the
embedding direction is current behavior under the UBA (N1)."

Here the algorithm is again, expressed as a rule:

"*N0. Paired punctuation marks take the embedding direction if the
enclosed text contains a strong type of the same direction. Else, if the
enclosed text contains a strong type of the opposite direction and at
least one external neighbor also has that direction the paired
punctuation marks take the direction opposite the embedding direction."

This rule amounts to if any text matches the embedding direction, since
"if," "then" is applied in sequence. This is fine IMO. (And, otherwise,
if all text inside the mirrored punctuation is neutral I suppose the
embedding direction should be taken, I would suppose, not a neutral
direction, based on the algorithm given at the url above, which, as I've
said, I like.)

However, as far as the the bidi parentheses algorithm goes, what about
the following symbols formed from various punctuation marks?

(-: , :-)

Would I treat the text between the two happy faces as neutral opening
and closing text? These sequences should be somehow excepted, I think.

The above text is a comma separating a happy and a sad face which will
all work as neutrals probably.I believe that these characters would be
treated as the following sequence (a "/" separates each character):
ON/ES/CS/CS/WS/ON/ES/CS/ ). That is, these are all weak or neutrals. So
this case might pose no problem.

I suppose we have to resolve the "ES" and "CS" characters first though,
which then are resolved to other neutrals so all we have are neutrals,
which take the embedding direction, and now of course the parentheses
are interpreted as such.

But what about the following text (set off from my comments by asterisks)?

* * *

Salam my friend! KAYFA HALUK? ANAA LHAMDU ULLAA (-: some problems
though making my emails work with this new algorithm so ANAA LASTU
SA'IYDUN )-: any suggestions?

* * *

Although I would tend to support exempting the happy face sequence from
the parentheses algorithm, the happy faces here enclose parenthetical
text.

According to the rules Marcin has suggested, but not really to those of
the parentheses algorithm, the above "enclosed text" would be treated as
RTL and thus some ordering would be reversed though I've not traced it
through. Your algorithm treats this as RTL since an R character
immediately precedes the parenthetical comment and since there are some R
(strong RTL) characters within the parenthetical comment.

The levels are: 0s for the L text

then 1 for text KAYFA HALUK ANAA LHAMDU ULLA

Then we find a mirrored piece of punctuation, and then a bit later a
close parentheses (now we have to pop the stack back to the previous,
and we find a match, and so have opening and closing punctuation).
Whatever algorithm we use for display, I hope these two faces, if they
are to be treated as mirrored at all, will display as left-to-right.

One question: what level will the text in parentheses/happy faces be: 1s
and 0s still? (or 3s and 2s?) (Sorry for asking this, but would it work
better to treat the text inside the mirrored punctuation as a new
embedding level? (I'm not a developer but may try to think through this
sometime. I don't see how it will improve things to treat this text as a
new embedding level)

> Notes:

> [1] This means that, if there is any mismatched pair of punctuation
marks, the rule will be applied neither to that pair, nor to any
enclosing pair.

> From Aharon Lanin’s comment #5 I understand that to be the original
intent of the BPA, the current (ambiguous) wording notwithstanding. If a
> more complicated algorithm is desired, it would have to be spelled
out here.

> [2] I prefer “non-neutral” to “strong” here, to remind the reader
that EN and AN also have to be taken into account (other weak types and
AL > having been resolved already).

> [3] A check for mixed types seems to be redundant; if there are
mixed-direction types, then at least one is in the embedding direction.
(This is > based on my reading of the current draft; if “mixed
strong types” was intended to include mix of e.g. R and AN, then this
condition would have > to read “… more than one non-neutral type (L,
R, EN or AN), or at least one non-neutral type of the same direction as
the embedding direction”.)

> [4] This is based on the way I understand what “external neighbor”
was intended to mean. The wording “if any” indicates that sos/eos are
not > included (if they are, then every character in an isolating
run sequence is preceded and followed by some strong type).

> [5] This covers the case when the enclosed text does not contain
any strong character; changing both marks to ON prevents mis-pairing the
OP with a later CP. Note that the CP does not actually have to be
changed to ON, as it makes no difference to further applications of rule
N0 or to rules N1 and N2. (However, if a more complicated pairing
algorithm is specified, it may become important to change both OP and CP
here.)

> Note also that the new bidi classes may create additional ‘legacy’
classes of conforming systems (see chapter 4.2), namely those that use
Bidi_Class instead of X_Bidi_Class (and thus effectively ignore rule
N0).< br />

One more comment from me at this point: I tend to agree with one of Phillipe's comments:

Date/Time: Sat Dec 22 09:08:55 CST 2012
Contact: verdy_p_at_wanadoo.fr

Name: Philippe Verdy

Report Type: Public Review Issue

Opt Subject: UAX#9 (UBA) PRI 3.3.4 Resolving Neutral Types and stability

> "(5) If a character A is mapped to a character B for mirroring
(Bidi_Mirroring_Glyph=code point B), the character A and B must be distinct
and NOT canonically equivalent to A: NFC(A) != NFC(B)"

(I may also agree with the comment that follows, [6], which I am sorry; I need to think through.)

Best,

--C. E. Whitehead

cewcathar_at_hotmail.com

--C. E. Whitehead
cewcathar_at_hotmail.com

Received on Sat Jan 19 2013 - 12:18:13 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 19 2013 - 12:18:15 CST