Re: Bidi in HTML

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Sat May 18 1996 - 15:58:59 EDT


Jonathan Rosenne wrote:

>This is a summary of my understanding of the discussion on bidi so far:

Many thanks for this very clear summary. As far as I can see, we have
made quite some progress, and I am positive that we will finally
obtain a good solution, even if at the current time, some differences
still remain.
I am writing back so quickly because I will not have very good
net access for the next week or so, and I want to advance the
discussion as far as possible.

>1. HTML, as a higher level protocol in the sense used by Unicode,
>provides the base directionality for each "block-type" element. The
>directionality may be specified by means of the DIR attribute or
>inherited from a higher level element or from the global directionality
>of the page.
>
>For each such element, the embedding level is reset according to the
>base directionality.

Agreed. This is what the draft is intending to say, but I guess your
expression is clearer.

>2. HTML also provides the global directionality of the page by means of
>a DIR attribute in the HTML element. If not specified, it is left to
>right. This global directionality is the default directionality for all
>the block-type elements in the page.

Agreed. We definitely have to specify the default global directionality,
and LTR makes much sense here.

>3. The subject draft had proposed an elaborate system with tags and
>attributes. For embedding, in addition to the Unicode codes, a DIR
>attribute on in-line elements was proposed. For overrides, a BDO tag
>had beeen proposed.
>
>In fact, the actual use of embeddings and overrides is very rare
>(although necessary) and it is not justified to burden HTML with these
>rare occurences, especially as they are available in the underlying
>character set and the proposed HTML extensions are an alternative way of
>doing the same thing.
>
>HTML should allow the specification of bidi formatting, when required,
>by means the Unicode formatting characters and corresponding named character
>entities.

I guess this was and still is the main point of disagreement.

There are two main arguments against this solution:
- Raw text HTML editing.
- Interference of bidi structure and markup structure.

One of the main arguments for having embedding/overriding as
markup was the problems of distinguishing markup bidirectionality
and final text bidirectionality when using a raw text editor.
You have shown some techniques that can reduce this problem
(although it is not eliminated), and we agree that tool support
can alleviate or eliminate this problem.

The other main argument, which I have mainly explained to you
in private mail, and which I do not in any way see answered
here, is the question of interference of markup sturucture and
bidi structure. By moving bidi embedding and override to markup,
we can assure that these two structures are in sync. This helps
keep documents clean and nicely structured. This may not be
of extremely high importance in some cases and for some users,
but for users working with large document collections and using
SGML techniques, structural integrity is very important, and can
best be guaranteed by making bidi embedding and override markup.

To give an example, does it make sense to have something like:

<Q>text text text &RLE; text text text</Q> text text text &PDF;?

Whatever directionality the "text" snippets are, it does not make
sense. If we have a quote (<Q>), then it is the quote that is
embedded; indeed, this is the most frequent embedding case,
and I do not expect many other in-line elements with a DIR
attribute. Now the above degenerate case can not be checked
if we have &RLE; and &PDF; as charcaters like any other,
they need to be markup of some form.

Some people might say that this is not of importance because
today's browsers allow things such as
        <B>text <I> text </B> text </I> text
The fact is that this is, for good reasons, not allowed or defined
by any standard or DTD whatsoever, should not be produced
by any reasonable tools, and hopefully will die out.
The i18n standard we are discussing here similarly should
define a reasonable but well structured solution; if a browser
decides to parse some faulty text in a way that *might* make
sense, this is not what the standard needs to care about.

So what I personally could accept as a compromize is that
we keep the embedding/overriding as markup, but allow
(without having to definine shorrefs in the DTD) the in-line
directionality formatting codes RLE, LRE, RLO, LRO, and PDF, if:

- Either any pairs of something+PDF all fit completely inside
        markup (i.e. cases such as &RLE; text <Q> text </Q> text &PDF;
        would not be allowed, only <Q> text &RLE; text &PDF; text <Q>
        and the like).

- Alternatively, allow pairs of something+PDF if they fit into the
        markup structure and don't disturb the hierarchy
        (i.e. cases such as &RLE; text <Q> text </Q> text &PDF;
        would be allowed; &RLE; and &PDF; would be a short notation
        for <SPAN DIR=RTL> and </SPAN>).

In either of these proposals, all other combinations of
markup and structural bidi formatting characters should be
defined illegal or undefined. Only in this way can we guarantee
reasonable structural document integrity.

>The full complement of Unicode formatting characters should be
>supported, including those used by Arabic.
>
>Of course, the Unicode charaters may be used directly, but since HTML
>allows other character sets these names are needed.

I mostly agree, but would like to point out that they are always
available as numeric character references.

>Providing character entitiy names for all these codes (instead of the
>partial list proposed) makes HTML more consistent, avoids the need to
>redefine bidi formatting, and avoids the possibility that the
>re-definition in HTML differs from that of Unicode.

There is intention and no possibility that a re-definition differs
from Unicode. Even with markup, mapping back to Unicode
characters is absolutely no problem.

>The proposed solution, that an attribute to in-line elements will be
>equivalent to the automatic generation of formatting codes in front and
>at the end of the element (e.g. &lre; and &pdf;) would have been useful
>had the need been common. As the need for these codes is rare, and as a
>simple alternative is available, the solution is not justified.

The reason is not the frequency of these cases, but their structural
meaning.

>Another misunderstanding is the implied assumption that the author is
>aware of the bidi formatting codes. In fact, these are produced by the
>bidi editor without the author's knowledge based on certain interactions
>between the author and the editor, mainly the keyboard language. See,
>for example, the Accent editor. Since the author is not normally aware
>of these codes, making them markup places an additional burden on the
>author, especially as the authors of the subject draft expressed a
>desire to support HTML authoring with a raw text editor, without an HTML
>authoring tool.

The issue for raw text editing is definitely different. There, the author
will get aware of the differences between bidirectionality as long
as the markup is present and bidirectionality when markup is
parsed away, and will have to deal with it.

As for tools and editiors, the aim is indeed that the author does not
have to be avare of the issues. But I have the suspicion that the
current porposal of having bidi formatting only as characters is
more related to the fact that although bidi tools have some ways
to manipulate bidi text up to a point at which the author is satisfied,
the current tools don't really "understand" much about bidi.

What I mean by this is that appropriate formatting characters
are inserted whenever the user changes something explicitly
or during a copy/paste operation, but that the tool has
no or only very limited ways of reducing formatting codes
to equivalent representations with less formatting codes.

This could mean that bidi is still not very well understood,
that it was designed with too much complexity, or that the
problem of reducing formatting codes is intractable per se.

Maybe the above is just speculation (and I would be happy
to hear it actually is), but such issues should be discussed
in detail and not just be brushed over.

>Following is the list of additional named entities:
>
> <!ENTITY lre CDATA "&#8234;"--=left-to-right embedding-->
> <!ENTITY rle CDATA "&#8235;"--=right-to-left embedding-->
> <!ENTITY pdf CDATA "&#8236;"--=pop directional formating-->
> <!ENTITY lre CDATA "&#8237;"--=left-to-right override-->
> <!ENTITY rle CDATA "&#8238;"--=right-to-left override-->

>I suggest that the other formatting characters also be included.
>I copied them from ISO-10646 and invented abbreviations.
>
> <!ENTITY iss CDATA "&#8298;"--=inhibit symmetric swapping-->
> <!ENTITY ass CDATA "&#8299;"--=activate symmetric swapping-->
> <!ENTITY iafs CDATA "&#8300;"--=inhibit Arabic form shaping-->
> <!ENTITY aafs CDATA "&#8300;"--=activate Arabic form shaping-->
> <!ENTITY nads CDATA "&#8301;"--=national digit shapes-->
> <!ENTITY nods CDATA "&#8302;"--=nominal digit shapes-->

Apart form the comment above, there is a more basic issue here.
Whereas everybody agrees that without bidi embedding and
override, some text can not be represented and rendered
correctly, the necessity for these formatting characters and
their underlying mechanisms is much less clear.

I do not have all the details present at the moment, but as far
as I know, these could all be eliminated by a preprocessor.
Their main reason they have been included in Unicode are
compatibility issues. Indeed, my draft of Unicode 2.0 says
very clearly (p. 4-80): "The use of these ... is strongly
discouraged in the Unicode standard."
Introducing them as named entities would definitely be
against this policy of strong discouragement!

>4. Unusual sequences
>
>The interaction of unusual sequences of codes and markup should not be
>addressed by this specification.
>
>This includes the cases of unmatched pairs of formatting codes, of
>markup between characters that would not normally be separated etc.

In asfar as this concerns interaction of e.g. &ZWJ; across markup, I can
agree that we just leave it unspecified. I also agree for unmatched pairs
of formatting codes, if we should indeed allow these codes.
As for things such as RLE and their interaction with markup, please see
above.

>5. the LANG attribute
>
>The LANG attribute has no effect on bidi.
>
>It is not easy nor useful to specify the list of bidi languages, since
>the number of languages that are by default written RTL is not really
>that small, and that there are languages, such as Turkish family of
>languages, that can be written with different scripts and directions.

Glad to see that you are agreeing with me. I think it is best
if we insert a paragraph like the above in the draft.

>6. Preformatted text
>
>Text under the influence of a <PRE> tag and other tags indicating
>preformatting should be considered preformatted only as far as HTML is
>concerned, not on the character level.

We can include this specification. Some people may want it the other
way round, but they can also use LRO, to which we agree that it should
not be used in this way, but we know we can not prohibit it.

>7. Conformance
>
>Conforming user-agents are required to apply the bidi presentation
>algorithm if they display right to left characters.
>
>If the non-displayable character is a right to left character, there
>is no requirement to apply bidi processing to that character.

Agreed, with a small modification for the second sentence:

If there are no displayable right to left characters, there
is no requirement to apply bidi processing.

Justification: It is very difficult to say what it means to not
apply bidi processing to a single character. Also, if there is
an Arabic character, e.g. one specially used in Urdu or
Farsi, which cannot be displayed, and this is then treated
LTR and mixes up the rest of the display, this would not
be desirable.

>8. Additional items
>
>The following items should allow international values, i.e the full
>character set:
>
> IMG ALT
>
> INPUT VALUE
>
> OPTION VALUE

Agreed. I hope this is technically (SGML) possible.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT