Bidi in HTML

From: Jonathan Rosenne (100320.1303@CompuServe.COM)
Date: Fri May 10 1996 - 17:34:19 EDT


Re: Re: I18N of HTML - Hebrew

>Message from Martin J Duerst <mduerst@ifi.unizh.ch>

First, I would like to apologize for joining the discussion so late.
Most of the standards people in Israel were not aware of the draft
until very recently, and it was a crash project to get the comments
out.

>When designing BIDI for HTML, we made every attempt to not deviate
>from the Unicode standard, both in its wording and in its essence.
>And as far as I can see, we indeed did not deviate from it.
>And we definitely agree that we don't need different standards.

I believe you did deviate, and will try to show were.

Main items:

>- Adding formatting characters so that the finally rendered text
> looks as desired can lead to very strange display of raw
> HTML in a bidi-aware editor!

There are simple solutions to this.

>- Formatting character pairs such as RLO-PDF LRE-PDF and so on
> could interlace with the markup structure, which is not
> desirable.

There is no interaction between the text and the markup. The bidi
algorithm applies only to the text.

>- The most straightforward implementation is to add the necessary
> formatting characters to the text and use a Unicode-compliant
> text "object". Absolutely no implementation change is necessary
> at the rendering/display level.

The most straightforward implementation is to add nothing to a
Unicode text and use a Unicode compliant browser.

>- Formally speaking, Unicode allows supplementation or overriding
> of some directional characters by higher-level protocols.

This is a misunderstanding of the Unicode specification.

>For HTML (not necessarily for SGML), we have to assume that
>quite some part of the production is written directly with
>a raw text editor, for which we will assume standard BIDI
>support, but not necessary any knowledge about HTML.

1. Using a standard bidi text editor works fine - I tried it with
Hebrew MS Word, Dagesh (The Hebrew version of Accent) and MS Notepad.
(They are not Unicode compliant, but similar enough in this respect).
The only problem is the English markup looks a bit strange. Of course
one has to set the basic direction to right to left.

2. A better alternative is to use Hebrew markup, then run it through
a post-processor or macro that converts the Hebrew tags to English.
This way it looks right when editing the document, and it's a lot
easier to type - one doesn't need to switch languages all the time.
This would be straight one-to-one replacement of English tags with
Hebrew translations.

3. It is safe to assume that sooner or later we will have native bidi
HTML editors.

4. In any case, I don't think it's a good thing to base the whole
standard for years to come on the basis of helping users use
inappropriate tools, especially as there are immediate practical
solutions (see items 1 and 2 above).

>This is not a deficiency of the Unicode BIDI algorithm
>as such; it is due to the fact that we are working on a
>META-level (i.e. describing text with text) instead of
>working on the plain text level.

The meta-level is not text and does not participate in the bidi
rendering process.

The correct process is in principle as follows: The meta level
process analyses the text and extracts a plain character string for
the element (paragraph or block, not line). For HTML it means
ignoring line ends and whitespace etc. and removing the tags. In
parallel, the process remembers the properties of each character,
such as font, size and color or to what tag they belong. The bidi
algorithm produces a physical rendering, assigning each character its
place in the re-ordered text. Each character keeps it's attributes
through the re-ordering process.

Most bidi word processors work that way.

So you can see that markup does not interfere with the bidi
algorithm.

> So if you have
>a two-line embedded text delimited with RLE-PDF

In addition to my previous comment, this is a misunderstanding of the
meaning of the formatting codes. There are other examples in the
subject message. While it is "legal" Unicode to place a formatting
code in front of a large section of "visual" bidi text this was not
the intention and is not proper use of these codes. The Unicode bidi
algorithm is based on the text being logically ordered and on
implicit directionality. The formatting codes are intended to be used
in exceptional cases and with a very local scope.

Any other use, especially as a practical solution for the quick
conversion of non-Unicode texts, should not be given a major weight
in designing standards. Anyhow, there are programs that do a
reasonable job of converting "visual" bidi to Unicode.

>2. Formatting codes
>
>The proposed attribute is not ambiguous! Its semantics are
>defined clearly for all elements.

>A human author, when preparing some HTML text, might think
>on a <BODY> "This is Right-to-left, so let's add a DIR",
>she might similarly think on a paragraph "This is Left-to-right,
>so let's add a DIR", and then again on a <EM> or <SPAN>
>"This is Right-to-left, so let's add a DIR".

On a block level element, this is OK. She is providing the basic
direction of the block. In most cases it is superfluous, because the
majority of block items have the same direction as the document. But
on an "in-line" element she is doing something completely different:
she is saying "although I am typing English letters, I want them
displayed from right to left". I don't think you meant that.

>So we decided to use the solution that is easy to understand
>and very natural to human authors (you just indicate directionality,
>which will mean whatever appropriate on any given element) and

But the idea with Unicode is that you do not need to indicate
directionality, except for the global directionality of the document
and of any exceptional block elements.

> Plain ArabiC<EM>&zwj;emphasized Arabic</EM> plain Arabic.

What about
      Plain French boi<EM>^te</EM> plain French.

These sick examples don't mean anything. The browser would do
whatever it happens to do, and let's hope it doesn't crash.

>The second class of formatting characters is those with long-
>ranging influence, including RLE, LRE, RLO, LRO, and PDF.

These characters should not be used that way, and there is no need to
give such use special consideration.

>(Much of this, as well as most of the arguments in the
>rest of this mail, has been discussed in html-wg
>extensively, and I would suggest that the relevant
>parts of the archive is scanned by interested parties.)

I had tried to access them, without success.

> whether any and all markup opens a new embedding level,
>or whether there is only a new embedding level when direction
>is explicitly specified (either by DIR or maybe be indirectly
>by LANG).

A block level markup resets the embedding level according to its base
direction (explicit, inherited or implied). Other markup has no
effect on the embedding level.

>There is no other implicit algorithm.

If you don't want a new algorithm don't specify bidi behavior on the
character level. Just say that the text is rendered according to the
Unicode specification. The moment one adds things, explains them or
replaces them with "equivalent" or "identical" features one creates a
new, different, standard.

>This is the most reasonable way to guarantee that meta-level
>and base text level can be kept reasonably separated.

They are kept separated by the means described above. Only the base
level text is subject to the bidi algorithm.

>Also, I want to mention here that actually our proposal very
>clearly adheres to the Unicode standard. Just for the record:
>
>Unicode 2.0, in Chapter 3.11, under the title
>"Higher-Level Protocols", says: "The following are concrete
>examples of how systems may apply higher-level protocols to the
>ordering of bidirectional text." and as one of this examples
>gives:
>"* Supplement or override the directional overrides or embedding
>codes by providing information via stylesheets about the embedding
>level or character direction."
>At the end of the "higher-level protocol" subsection, the text
>also says "When text using a higher-level protocol is to be
>converted to Unicode plain text, formatting codes can be inserted
>to ensure that the order matches that of the higher-level
>protocol,...."

This is a misunderstanding. This item should be read in context. HTML
does not include stylesheets. This is not an option to ignore bidi
formatting codes or replace them by approximately similar things.

In this paragraph, the item which relates most strongly to HTML is
the first: "A higher-level protocol may provide for overriding the
basic level embedding, such as on a field, paragraph, document or
system level.". Unicode says "may", but in the context of HTML it
should be taken as a must. The draft does this, but needs cleaning
up.

>3. Language attribute

>Why should it interact on the global level, but not elsewhere?

It is needed on the document level, we have to know whether it is a
left to right document (maybe with some right to left in it) or a
right to left document. On block level elements LANG could be
considered to imply DIR. On in-line elements, which do not have a
base direction, it is cannot provide directional information - if one
uses Arabic or Hebrew characters, they are right to left. If an
override is needed, there are appropriate characters in Unicode, this
is not an attribute of the element.

>5. Justification - the ALIGN attribute

>True, but this is the responsibility of the browser implementor
>to do it that way (or fail on the market). There is no need to
>specify that in the standard.

The purpose of the HTML specification is to make sure that a document
will appear equivalently or similarly in any conforming browser. If
two browsers disagree on the interpretation of ALIGN and the other
features for which the same rationale is proposed, and if we were to
accept that this is not part of the spec, they would both be
conforming and the result would be terrible.

>>With the data of each field, the form should return the direction
>>attribute actually selected by the user in addition to the
>>character set.
>
>This idea is new. Is there any specific application where this
>would be needed, or a wide general requirement? In those probably
>rather few cases where it is really needed, why not just add
>a button where the user can set this information?

Proper interpretation of bidi text requires a base direction. If the
user is allowed to change the DIR attribute of the field, the server
needs to know.

>>As proposed, the form should be able to restrict the user's input
>>to a specific character set, according to the requirements of the
>>server.
>
>Can you make a proposal of how this could be done? Should some
>relation between the "charset" parameter of the document and
>of the field be allowed? What if I obtain a document via a
>conversion proxy in Unicode?

Maybe we should add a "regular expression" attribute to the form
field. This could be useful for other purposes too. It would be
independent of the coding scheme.

>4. SGML

>Does "Hebrew tag" mean that the tags are actually written
>with Hebrew letters, or just that they are additional tags
>written with Latin letters, such as <ph> below?

In Hebrew, using Hebrew letters. For example, the Hebrew tag for ph
is the letter Pe, and for p the letters Pe Alef.

>Please note that we have considered allowing (almost) any
>character from ISO-10646 in tags, but have not done it because
>we have met too much resistance, and because it would have
>been too much work for not enough benefit.

I have not suggested it. It has no added value. The author can
use his own language for tags and then post process to English.

>This is neither very general (the Arabs will prefer <pa>) and
>nor very structured (mixing attributes with entities).

(I assume they would use <pa> for an Arabic paragraph). It has the
advantage of being short and convenient. <p> is the most common tag.

>In this description, I miss RLE and LRE. If they are indeed
>supported, it would be nice if you tell us how this is done.
>If they are not supported, I would be interested to know why.
>Also, I don't understand whith what IMD is paralles in Unicode.

RLE and LRE are meaningful only with Unicode, where they are provided
as characters. RTL, LTR and IMD are intended for non-Unicode
character codes. IMD specifies Unicode-like implicit directionality.

>Also, I would like to know what SI 1680 prescribes if the
>formatting characters are indeed available. For example,
>is it allowed to start an overriding level with an <RLO> tag
>and end it with a PDF character, or start with an LRO character
>and end with </LTR>?

The is no <RLO> tag, only an <RTL> tag. RTL provides the basic
direction, RLO overrides the character direction.

>Appendix B. Entity names for the Hebrew Characters

>We have thought about including more than the current Latin-1
>characters as named entities. It might not have been too
>difficult for a few well-defined sets, but overally, it
>might easily have become a work without end, might
>have lead to some difficult decisions (esp. in the case of
>conflicts) and would have expanded our draft too much.

This is not specifically a bidi item. I suggest a synonym mechanism.
Allow something like including ENTITY tags in the document. (Am I
confusing meta-meta with meta? So what?)

Jonathan Rosenne
JR Consulting
P O Box 33641, Tel Aviv, Israel
Phone: +972 50 246 522 Fax: +972 9 56 73 53



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT