RE: BIDI: possible fix

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon May 21 2001 - 09:46:30 EDT


Roozbeh Pournader wrote:
> Do you agree with me that rule W4 should be fixed to also
> change the type
> if there are NSMs over the separator? They should be counted as one I
> mean, so if I want to underline a separator, that underlined separator
> should count as one normal separator, not two.

Roozbeh's fix sounds correct, by the theoretical point of view.

Yet, circled or underlined numeric separators seems such an unlikely case in
real word! I wonder whether it would be worth to force all existing
implementations to change only for this.

And I am not so sure that these implementation are so few: just imagine
re-shipping all Windows NT, Linux, or Mac OS boxes out there... And I just
named three of them which were on the top of my head.

I notice two things:

1) Just a tiny subset of Unicode characters are defined as "number
separator" or "number terminator" but, in the real world, all sorts of
characters can act as one or both of these categories.

E.g., the "thousand separator" is a blank in France, while it is an
apostrophe in Switzerland; the "decimal separator", for monetary amounts,
can be any sort of currency symbol (e.g. "$" or even "US$").

2) The Bidi Algorithm explicitly allows applications to define their
"higher-level protocols" to override behaviors that are not optimal in a
given environment or application. See
<http://www.unicode.org/unicode/reports/tr9/#Higher-Level_Protocols>. And
defining alternative ways of handling numbers is one of the cases which is
explicitly listed:

    "- Override the number handling to use information provided by a
       broader context.
       For example, information from other paragraphs in a document
      could be used to conclude that the document was fundamentally
      Arabic [...]"

But I think that Roozbeh is very interested in interchanging text in a
standard way so, apparently, a higher-level protocol is not the way to go.

But the same section of UAX#9 suggests how standard interchange in plain
text could take place *also* in the presence of a higher-level protocol:

    "When text using a higher-level protocol is to be converted to
    Unicode plain text, formatting codes should be inserted to ensure
    that the order matches that of the higher-level protocol [...]"

In the case of Roozbeh's underlined numeric separator this could mean that,
when text is to be exported as plain text, each number containing a "circled
separator" could be enclosed in a left-to-right embedding:

        <LRE>123,<COMBINING ENCLOSING CIRCLE>456<PDF>

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT