RE: Unicode editing (RE: Unicode complaints)

From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Mon Mar 19 2001 - 15:50:05 EST


On Mon, 19 Mar 2001, Marco Cimarosti wrote:

> And this could happen with lam-alef, not only for users who don't have the
> ligature on their keyboard, but also for users who have it but fail to use
> it.

People will fail to use it. Software should show the ligature. Any way, I
don't like using a Lam-Alef character in the text. I want to do sub-word
searching and things like that, and it will ruin those.

> But you can always turn off the "automatic adjustment" and type an unusual
> combination of shapes, including "initial lam + final alef", if that is
> needed.

That's rarely needed. I think I'll let the user press that three character
sequence if she wants that. There are fonts who don't have that ligature,
for aesthetic reasons, but they are few. Existence of these fonts, is
another reason for not using the Lam-Alef ligature in the character
buffer. If really want to be kind, you should also let her insert a
character between the Lam and Alef (or at least let her put an accent on
the Lam, which is a basic right ;))

> By the way, this reminds me of a point that could be interesting to you: a
> sequence that violates the normal ligating behavior (like the "initial lam +
> final alef" above) would automatically generate the proper Unicode sequence
> like LAM+ZWJ+ZWNJ+ZWJ+ALEF with no need for users to know about the sequence
> ZWJ+ZWNJ+ZWJ, that I know you don't like.

If only you knew about the hard time I had while implementing that
recently. I was writing an Arabic display engine from scratch, and that
didn't fill in. I had designed a simple automaton that worked
incrementally: add the character in a shape that it will most probably
come out later. It was a two state automaton: we're joining, we're not
joining. With that, I had to add four states! (That's not only
ZWJ+ZWNJ+ZWJ that should prevent ligatures, any sequence of ZWJs with at
least some ZWNJ in between should also prevent them.)

> What the user sees is the same anyway. And also what will end up in the
> actual Unicode file is the same, but you automatically get rid of
> unnecessary ZW(N)J characters (i.e. a ZWJ between two characters that would
> join anyway, or a ZWNJ between two characters that cannot join).

Please note that a ZWJ between two characters that join may have semantic
meaning: "ligate them if you have the font". So if you have that between
Lam and Hah, you should not delete that, because some fonts have the
ligature.

Ouch, something came to my mind just now: future versions of Unicode may
consider some meaning for new things, like ZWNJ between two characters
that won't join, or what about ZWJ+ZWNJ+ZWJ+ZWNJ+ZWJ? You should not
delete the characters, or you will become non-conformant in a future
version. (BTW, this ZWJ+ZWNJ+ZWJ is the "worst" thing in the whole
standard, in my opponion.)

--roozbeh



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT