L2/04-093 From: Martin Duerst Date: Feb 03, 2004 Subject: Re: Bidi conformance Hello Mark, I'm sorry I'm late with feedback on this. You write: > 4. Interpret spans of text separately in the presence of markup > > When displaying a document consisting of textual data and visible markup, > a higher-level process can handle syntactic elements in the markup > separately from the textual data. I think we should clearly separate between two things: a) interpretation of the markup (e.g. for HTML dir=rtl) b) display of the source My understanding is that 1) is covered in "3. Emulate directional overrides or embedding codes". This item is intended to be about 2), but it does not seem to say so clearly. I would suggest to indicate somehow that this is about source text, as opposed to final rendering. > For example, suppose an XML document contains the following fragment: > > * ARABICenglishARABICARABIC2english > > One can consider this as consisting of 5 separate pieces: > > 1. ARABICenglishARABIC > 2. > 3. ARABIC2 > 4. > 5. english This is a highly simplified example. Element names, attribute names, and attribute values can also be affected. > To make the XML file readable, the display can order these elements all > in a uniform direction (e.g. all left-to-right), and apply the bidi > algorithm to each field separately. Alternatively, the process could treat > each of the pieces of syntax as if it were a single inline object, e.g. > as U+FFFC OBJECT REPLACEMENT CHARACTER, and apply the bidi algorithm to > the whole. I think one other way of thinking about this is even better: The process can assign strong types to the syntactically relevant characters. In the above context, the sentence: "When text using a higher-level protocol is to be converted to Unicode plain text, formatting codes should be inserted to ensure that the order matches that of the higher-level protocol." does not feel right. Copy/paste of source text would be logical; different editors might have somewhat different ways of reordering that, or may be configurable, the same way as different source editors have different ways to color source code (and are usually configurable). You also write (earlier in the document): "An open issue is whether we want IRIs (URLs) to fall under the span of #4 or not." The current solution for bidi IRI is described at http://www.w3.org/International/iri-edit/draft-duerst-iri.html#Bidi Actual Hebrew/Arabic examples corresponding to the ones at that place are at http://www.w3.org/International/iri-edit/BidiExamples. This solution has some limitations, and there are some cases where it is questionable whether the result is okay or confusing, but there are some simple general rules that correspond closely with how people read general bidi text, and it's the best we were able to come up with in several years of mulling over this problem (the solution is due to Mati Allouche). My position is currently that IRIs should not be covered by the item above. I see at least the following distinctions: - IRIs are small things that appear in many different contexts that have to be consistent. XML and other source text usually appears in big chunks (usually whole files) where the main concern in an editor (software) is efficient operation by the (human) editor. - People working on XML and other source text should be ready to learn the conventions of an editor. Use of IRIs should be possible with the most minimal knowledge possible. - The main case of typing in IRIs is copying them from paper, napkins,... Typing XML from paper is also possible, but not that usual. - IRIs appear in an amazing number of contexts. All these contexts would have to do the same thing, different from the usual bidi algorithm. This would be easy for some places, but very difficult for some others (e.g. plain text email). - As said, for IRIs, we need completely consistent display. For XML or other source views, this is not the case. Here are a few points where people might have different preferences: - How to pick up overall bidi context: on a per line base, on a per element base, on a per file or per user base - Whether to invert the order of start and end tags or not. - On which criteria to decide whether attributes should come right or left of the element name in a start tag (e.g. overall direction, script in element name,...) and whether attribute names should come left or right of attribute values. Regards, Martin.