L2/04-093

From: Martin Duerst
Date: Feb 03, 2004
Subject: Re: Bidi conformance

Hello Mark,

I'm sorry I'm late with feedback on this. You write:

> 4. Interpret spans of text separately in the presence of markup
>
> When displaying a document consisting of textual data and visible markup,
> a higher-level process can handle syntactic elements in the markup
> separately from the textual data.

I think we should clearly separate between two things:
a) interpretation of the markup (e.g. for HTML dir=rtl)
b) display of the source

My understanding is that 1) is covered in
"3. Emulate directional overrides or embedding codes".

This item is intended to be about 2), but it does not seem to
say so clearly. I would suggest to indicate somehow that this
is about source text, as opposed to final rendering.


> For example, suppose an XML document contains the following fragment:
>
> * ARABICenglishARABIC<e1 type='ab'>ARABIC2<e2 type='cd'>english
>
> One can consider this as consisting of 5 separate pieces:
>
> 1. ARABICenglishARABIC
> 2. <e1 type='ab'>
> 3. ARABIC2
> 4. <e2 type='cd'>
> 5. english

This is a highly simplified example. Element names, attribute names,
and attribute values can also be affected.

> To make the XML file readable, the display can order these elements all
> in a uniform direction (e.g. all left-to-right), and apply the bidi
> algorithm to each field separately. Alternatively, the process could treat
> each of the pieces of syntax as if it were a single inline object, e.g.
> as U+FFFC OBJECT REPLACEMENT CHARACTER, and apply the bidi algorithm to
> the whole.

I think one other way of thinking about this is even better:
The process can assign strong types to the syntactically relevant
characters.


In the above context, the sentence:

"When text using a higher-level protocol is to be converted to Unicode 
plain text, formatting codes should be inserted to ensure that the order 
matches that of the higher-level protocol."

does not feel right. Copy/paste of source text would be logical;
different editors might have somewhat different ways of reordering
that, or may be configurable, the same way as different source
editors have different ways to color source code (and are usually
configurable).

You also write (earlier in the document):

"An open issue is whether we want IRIs (URLs) to fall
under the span of #4 or not."

The current solution for bidi IRI is described at
http://www.w3.org/International/iri-edit/draft-duerst-iri.html#Bidi
Actual Hebrew/Arabic examples corresponding to the ones at that
place are at http://www.w3.org/International/iri-edit/BidiExamples.

This solution has some limitations, and there are some cases
where it is questionable whether the result is okay or confusing,
but there are some simple general rules that correspond closely
with how people read general bidi text, and it's the best we
were able to come up with in several years of mulling over
this problem (the solution is due to Mati Allouche).


My position is currently that IRIs should not be covered by the
item above. I see at least the following distinctions:

- IRIs are small things that appear in many different contexts
that have to be consistent. XML and other source text usually
appears in big chunks (usually whole files) where the main
concern in an editor (software) is efficient operation by
the (human) editor.

- People working on XML and other source text should be ready
to learn the conventions of an editor. Use of IRIs should be
possible with the most minimal knowledge possible.

- The main case of typing in IRIs is copying them from paper,
napkins,... Typing XML from paper is also possible, but not
that usual.

- IRIs appear in an amazing number of contexts. All these contexts
would have to do the same thing, different from the usual
bidi algorithm. This would be easy for some places, but
very difficult for some others (e.g. plain text email).

- As said, for IRIs, we need completely consistent display.
For XML or other source views, this is not the case. Here
are a few points where people might have different preferences:
- How to pick up overall bidi context: on a per line base,
on a per element base, on a per file or per user base
- Whether to invert the order of start and end tags or not.
- On which criteria to decide whether attributes should
come right or left of the element name in a start tag
(e.g. overall direction, script in element name,...)
and whether attribute names should come left or right
of attribute values.


Regards, Martin.