3 big bidi bugs

From: Bernard Miller (Bernard_R_Miller@bytext.org)
Date: Wed May 29 2002 - 11:57:27 EDT


This letter describes 3 major technical problems with the current Unicode
bidirectional algorithm as described in UAX #9, version 3.20. Problems 1 and
3 have security implications. Other problems with the whole Unicode
bidirectional encoding approach, and their solutions, are discussed in the
recently updated Bytext FAQ and documentation (www.bytext.org).

(1) Line width dependent mangling, general case:
Step L2 of UAX #9 indicates that a line that resolves into a sequence of
characters with homogenous embedding levels will ALWAYS be displayed right
to left, regardless of what the embedding level is.

So, for example a line that with the L1 resolved embedding levels of:
2222222222222222222222222 will display right to left
3333333333333333333333333 will display right to left
4444444444444444444444444 will display right to left
etc

Likewise:
in 3333333333333333333333331, the 3’s will display left to right
in 5555555555555555555555551, the 5’s will display left to right
etc

It directly contradicts the writers intentions. It means that different
Unicode compliant applications will display the same characters in a
different order (depending on available line width). Examples of how this is
bad are given in question 12 of the Bytext FAQ (www.bytext.org/faq#12).
This can be fixed by rewording step L2 such that a reversal happens from the
highest embedding level to each lower contiguous embedding level, regardless
if the embedding level is represented by a character on the line, until the
embedding level of 1 is reached (or, as an optimization, until the first odd
embedding level equal to or lower than the lowest embedding level
represented by a character on the line).

(2) Line width dependent mangling, spelling conventions for quotes:
What is the purpose of step X10 if not to allow something like LEFT DOUBLE
QUOTATION MARK to be used as if it was an OPEN DOUBLE QUOTATION MARK? One
simply puts an embedding inside a quotation, such as “<RLE>quotation<PDF>”.
The problem with this is that it only works if the quotation begins and ends
on the same line. Examples of how the text is mangled when the quotation
spans multiple lines are given in question 13 of the Bytext FAQ
(www.bytext.org/faq#13).
This cannot really be fixed with minor changes other than to notify users
that the whole left=open, right=closed idea may not work as such when the
default automatic line breaking is used. Users should not rely on any
spelling conventions that do not bypass the effects of step X10 and
mirroring --how this can be done is described in the Bytext documentation.

(3) Mirroring ambiguities:
What if eor = sor?

text: R RLO whatever PDF N LRO whatever PDF
embedding level at step X9: 1 3 3 1 2 2
directional type at step X10: R R R ? L L

The above example should be in a monospace font. The original is at
www.bytext.org/faq#12.
Step X10 is ambiguous whether the “N” should be L or R. This means that if N
is has the mirrored property, some implementations might display the
mirrored form, others the non mirrored form, and others might result in an
error.
This can be fixed by deciding on a single form for such cases. Also, the
statement: “for two adjacent runs, the eor of the first run is the same as
the sor of the second” needs to be removed because it is not true.

Bernard

---
Bernard Rafael Miller, email: bernard_r_miller@bytext.org
Format enabling simplified 8 bit regexes of UCS characters: www.bytext.org
---



This archive was generated by hypermail 2.1.2 : Wed May 29 2002 - 10:32:32 EDT