The "f" in "fi" (was RE: Latin ligatures and Unicode)

Date: Thu Dec 23 1999 - 11:55:16 EST

John Cowan recently wrote:

>ZWL, though, does not cause "f" to become "the f-form used with i
>nor "i" to become "the i-form used with f preceding", because there are
>no such things, and it would be intolerably ad hoc to make them so.

My first thought about this was "Right, of course: a ligated 'fi' is a
single glyph, whether or not it is used to render a single code point or it
is just an 'f' + 'i' sequence".

However, for the following two days, I could not help stopping visualizing
things like "the 'f' in 'fl'", "the 'i' in 'fi'", "the 'alif' in
'lām-alif'", etc.

In a WYSIWYG environment, everybody expects that any instance of sequences
like "f" + "i" is displayed as a ligature, if the fonts so permits and
dictates (well, everybody but some friends of mine:-).

But, once the ligature is formed, it becomes a problem for screen editing.
What the users think they do when they edit an electronic document is to
insert, delete, substitute, move or mark *characters*. What they actually do
symbolic actions on *glyphs*, that are the visual representation of
characters, and this causes the software to actually change the characters
in memory.

If I wanted to type "mail" but inadvertently wrote "mil", what I want to do
is to move my caret between the "m" and the "i" and add the missing "a". And
I can do it.

If I wanted to type "mile" but inadvertently wrote "maile", what I want to
do is to move my caret between the "m" and the "a" and hit the
DEL-RIGHT-CHAR key. I can do it, and my caret correctly remains where the
"a" used to be: between the "m" and the "i".

Everybody already noticed that, if the "m" in the above examples is
substituted with an "f", we are going to have troubles. In a system that
displays "f" + "i" as a ligature, I cannot move the caret in the proper
place in "fil" to add my "a". I can certainly delete the "a" in "faile" but,
after I do this, my caret remains in an embarrassing location: "somewhere
*inside* a ligature".

What can programmers do about this? Some approaches:

#1 Avoid ligatures. - This is not acceptable in a WYSIWIG environment and,
for certain scripts, this is not acceptable even in the humblest text-only

#2 Split ligatures when the caret passes over them. - This is the same as #1
above, only less frequent.

#3 Once a ligature is formed, treat it as if it was a single unit. - Most
people, although perfectly literate, never noticed that "fl" looks slightly
different from, say, "fb" or "fh". Do you want them to notice it just to
decide they don't like *your* software?

#4 Pretend that the "ffl" glyph represents the first "f" only; the second
"f" and the "l" would then be zero-width things following the visible glyph.
- This is the same as #3 above, but even more puzzling.

But if our font represents an "fi" ligature as two ad-hoc artificial glyphs
(plus an ad hoc kerning pair, plus an ad hoc contextual shaping rule), we
obtain a double score:
- The display looks pretty, just like a printed book;
- The user's perception that characters = visible glyphs = keyboard strokes
may be supported, for the sake of usability.

Of course, this idea has its problems too. It is easy to see the single
letters in Latinate "ffl"; also seeing the single letters in Arabic
"lām-alif" is easy, but not as mach; but seeing the "mīm" in "lām-mīm" is
admittedly quite hard; and spotting the "ka" and "sha" in Devanagari "ksha"
requires some historical lessons from Peter T. Daniels himself.

Finally, the virama in the consonant clusters of many Indic scripts is
*really* invisible and there is no way we can visualize it *and* claim we

For the "difficult" cases like "lām-mīm" or "ksha", my idea would be to
decide arbitrary borders within the glyph, hoping that the user will follow
the reasoning.

For the "impossible" cases like the invisible viramas, I would step back to
#3 above, trying to enforce the user's perception that virama is *not* a
character by itself. One way to suggest this is to define a keyboard where
each "full consonant" is assigned to an plain key, and each "dead consonant"
(consonant+virama, in the encoding) is assigned to the corresponding shifted
key. No key should be assigned to virama itself: when someone exceptionally
need a stand-alone virama (e.g. in a didactic text), they would enter it
through less direct methods (e.g. using some "Insert Symbol" menu command).

So said, I wish you all a pleasant Winter Solstice/Christmas/End-of-Ramadan
Id. See y'all post-Y2K-bug.

_ Marco

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT