Unicode editing (RE: Unicode complaints)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Mar 16 2001 - 11:57:52 EST


Mark Davis wrote:
> to run the bidi algorithm to get visual order, insert the
> character, then run it backwards to get the logical order
> again, which tells you where to put the character.

I think that Roozbeh perfectly hit the problem of bidirectional editing with
his comments, and Mark's solution of a "backward" bidi algorithm sounds very
promising.

But what about other areas that pose similar issues?

- The "logical order" of Indic scripts. When you type a Devanagari "i" you
want it to stay where you typed it, not to run away like a rat. When you
want to type a letter before/after a reordrant vowel, where do you place
your cursor? When you hit backspace in proximity of a reordrant vowel or a
repha, what gets deleted?

- The logical order of Indic scripts not consistent with the visual order of
Thai and Lao.

- The automatic shaping of Arabic, Syriac and Mongolian not consistent with
the manual shaping of Hebrew and Greek.

- The automatic conjuncts of Indic script (encoded with virama as a joiner)
not consistent with the manual conjuncts of Tibetan (encoded with subjoined
vowels).

- The inconsistency of alternative encodings for the same graphemes (e.g.
precomposed vs. decomposed accented letters).

- The inconsistency of compatibility characters duplicating some
functionality of rich text -- e.g. U+2080 (SUBSCRIPT ZERO) vs.
<sub>U+0030</sub>.

My very humble opinion is that all these issues ask for a general solution,
which could be found defining an INTERMEDIATE LAYER between the encoding and
the rendered text.

In other words I imagine a sort of "WYSIWYG Unicode" that can be used
internally by the rendering and editing software components.

This layer would still represent "abstract characters" (or "abstract
glyphs", if you like), but the level of abstraction should be much lower
than that of Unicode, in order to match more closely what the user sees.

This requires a very well defined algorithm to map Unicode encoded text to
"WYSYWYG Unicode" (let's call it "abstract rendering algorithm") and the way
back (let's call it "abstract DErendering algorithm"). The existing bidi
algorithm and a "backward" bidi algorithms would of course be embedded as
components of the two general algorithms.

Users would of course interact with the "WYSIWYG" layer when editing text.

Also rendering engines and smart fonts could choose to interact with this
layer, rather than with the encoding. The benefit would be that all the
"general" issues like bidirectionality, Indic reordering, decomposition,
etc. would be handled in a standard and predictable way, while the smart
font would be left only with real typographical issues, such as whether or
not two adjacent items demand for a joint glyph.

Now, this is how I'd imagine the feautures of such a "WYSIWYG Unicode":

- Strict left-to-right visual order. There are still glyphs classified as
LTR, RTL or neutral, but this only affect whether that glyph goes to the
left or to the right of the cursor when it is inserted.

- No bidirectional controls whatsoever. All the Unicode bidi controls are
stripped off after running the "forward" bidi algorithm, and are generated
where necessary by the "backward" bidi algorithm. The "backward" algorithm
could be parameterized to generate different levels of bidi Unicode (e.g.,
using or not embedding controls).

- All Indic scripts are in visual order as well: "reordering" characters are
moved to their visual place during "rendering" and back to their logical
place during "DErendering".

- All sequences of characters that are perceived as single letters by users
are treated as such (e.g., laam-alif in Arabic or the ksha ligature in many
Indic scripts). Of course, the "DErenderer" maps these extra glyphs back to
the corresponding sequences of Unicode characters.

- The feature above might even be localized. A Slovak "rendered/DErenderer"
might map an Unicode sequences like "c" + "h" to a single code, while the
French version does not.

- Perhaps, contextual glyphs (e.g. Arabic positional forms, Indic half or
subjoined letters) could be mapped to independent items (like in Unicode
Hebrew or Greek), so that users may use these forms out of context without
having to use arcane controls like ZW(N)J. Of course, the "DErendering"
process would properly generate the necessary ZW(N)J when the shape of a
character is not the default for the context where it sits.

- To balance the obvious problems that the point before introduces, an
optional automatic adjustment feature may be defined to automatize typing.
This feature adjusts the codes of the two characters around the cursor upon
any editing action and, it is an insertion action, it also adjusts the codes
of the first and last glyph of the block being inserted.

- The whole process (especially the "DErenderer") behaves differently if it
works with plain text or with rich text (i.e., a subscript zero may be
DErendered as U+2080 or <sub>U+0030</sub> depending on whether a <sub>
mark-up is available or not).

This is my mental image of how Unicode rendering and editing should work. I
am curious to know how different this is from other people's mental images
and, especially, from actual software out there.

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT