Re: Unicode editing (RE: Unicode complaints)

From: Roozbeh Pournader (
Date: Fri Mar 16 2001 - 14:43:53 EST

On Fri, 16 Mar 2001, Marco Cimarosti wrote:

> - The automatic shaping of Arabic, Syriac and Mongolian not consistent with
> the manual shaping of Hebrew and Greek.

There is also something I always wanted to say about this. Automatic
shaping of Arabic has also some problems with the current implementors
approach. The user gets annoyed when she's looking at the monitor. Take
this example: she wants to type "MEEM-SEEN-TEH-QAF-LAM". She presses Meem,
she sees an isolated Meem, she presses Seen, the Meem becomes initial
Meem, and a final Seen gets added. She presses Teh, Seen becomes medial,
final Tah getting added, .... What if she could see initial Meem, medial
Seen, etc at the beginning? I know, this way she would see a medial Lam at
first, but that will become a final Lam as soon as she presses the space.

This way, the changes on the monitor will be minimum. So her eyes see the
minimum amount of changes on the screen, and the all-moving all-dancing
text becomes more stable.

The reason for the common practice is exactly the same as the one for
bidi, that implementors like to keep a logical buffer that is exactly what
the user has typed, with nothing more. They transfer the pain to the user.

> Now, this is how I'd imagine the feautures of such a "WYSIWYG Unicode":
> - Strict left-to-right visual order. There are still glyphs classified as
> LTR, RTL or neutral, but this only affect whether that glyph goes to the
> left or to the right of the cursor when it is inserted.
> - No bidirectional controls whatsoever. All the Unicode bidi controls are
> stripped off after running the "forward" bidi algorithm, and are generated
> where necessary by the "backward" bidi algorithm. The "backward" algorithm
> could be parameterized to generate different levels of bidi Unicode (e.g.,
> using or not embedding controls).

This is the most simple but working method, one that's implemented in good
old bidi software that got implemented by native bidi people. The users
feel easy with this, but this may not be enough for them.

Also, please note that there are two kinds of backward bidi, one that
guarantess the same rendering, and one that guarantees the logical order
also. Many versions of the first one exist, the most simple of them all
being the insertion of LRM+LRO at the start of the line and PDF at the
end. But a good one will need much more. I believe that characters like
LRE (but not LRM) have some semantic meaning, and they should not be
misused for getting the text right while there's no real embedding. When
you are doing "forward" bidi, and you're stripping every control, you
should keep the embedding information (levels, runs, etc), so you can put
them back in. With the current approach you are flattening the text,
ripping off some semantic meaning that was hidden in the order of the
characters, which we all recognize as the logical order.

In short, good bidi editing should address both the visual and logical
needs of the user. What you are suggesting, although is one of the few
good and simple methods, only addresses her visual needs. In short, if you
want your software to have a market in bidi countries, what you said is
the minimum requirement.

> - All sequences of characters that are perceived as single letters by users
> are treated as such (e.g., laam-alif in Arabic or the ksha ligature in many
> Indic scripts). Of course, the "DErenderer" maps these extra glyphs back to
> the corresponding sequences of Unicode characters.

"Lam-Alef" is not considered a single letter by Persian users. No recent
Persian keyboard has it. I also believe that Arabic keyboards have it
because of backward-compatiblity only.

> - Perhaps, contextual glyphs (e.g. Arabic positional forms, Indic half or
> subjoined letters) could be mapped to independent items (like in Unicode
> Hebrew or Greek), so that users may use these forms out of context without
> having to use arcane controls like ZW(N)J. Of course, the "DErendering"
> process would properly generate the necessary ZW(N)J when the shape of a
> character is not the default for the context where it sits.

In the Arabic case, this is old behaviour, one that should be avoided at
all costs. Many Persian keyboards have ZWNJ and ZWJ on them, and the
important thing is that the users feel at home with them, which is not the
case with things like LRM or RLM. ZWNJ is considered some kind of space
here, and many keep it on shift-space. This is what the users think of
them. Some word processors that do not have it, try to get intelligent by
automatically changing the spaces after some common prefixes like Meem-Yeh
(Meem-Farsi_Yeh really) to ZWNJs. I, personally, hate this behaviour, but
novices like it. Having contextual Arabic glyphs on the keybaord, belongs
to old age of mechanical typewriters. There are some old typesetters that
like that, but I do not know of any Persian vendor that ships that kind of
keyboard with their DTP system.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT