RE: Latin ligatures and Unicode

From: Marco.Cimarosti@icl.com
Date: Wed Dec 29 1999 - 08:13:40 EST


Kenneth Whistler wrote:
>> In the cases when these differences become significant, there should be a
>> way to encode them in plain text...
>You keep reasserting this, without really addressing Mark's questions
>regarding balancing the community of use versus the mechanisms needed
>for conveying these significant differences.

Ken is perfectly right here. I am just a casual i18n enthusiast and I often
keep at the surface of problems; I have been surprised more than once by the
complications that arise when someone else in the list goes more in depth
with some issues.

I chose one of past Mark's mails, that made interesting scenarios of
practical usage of ligations, and I will try to comment on that.

>> if some general solutions like a ligator
>> mechanism are not found, people will keep on asking new ligatures or
>> variants for ever -- because they need them, not because they are are all
>> dummies.
>And this misses the points that both Mark and John Jenkins have been
>making. Simply because there is a need for control of representation
>of ligatures in rendered text does not imply that encoding a ZWL or
>ZWNL (or both) *as* a character is required to do so.

It depends. When the control of ligatures is just a matter of taste, like
selecting the font face or the text color, then yes: there is no need for it
in plain text, and the matter is better addressed by higher-level protocols
on typography-quality systems. Who writes an e-mail in Courier won't
certainly need it.

But there are cases when having or not a ligature affects the meaning of the
text. These are certainly rare and marginal cases (in fact, this is being
discussed after Unicode's 10th birthday) but they exist and, because the
*meaning* of the text is affected, not only its presentation, the issue
should be addressed at the plain text level.

The contexts where these graphic variations become significant are mainly
historical and meta-linguistic, but they are nevertheless important for
someone.

Do you remember Carlos Levoyer, the guy who was dealing with ancient Spanish
texts? I suggested him to drop all the ligatures in his old books, and
expand them to regular modern spelling in his on-line edition. But if he
does not want to do so, he may need a way to specify ligatures like "ct"
(plus some special letters, like the long "s"). And, perhaps, he has to do
this in HTML, or in a database field: that is, in plain text, sort of.

A non-historical example for the need to control ligatures in plain text has
already been done: the "fi" ligature in Turkish. In most roman fonts, the
dot over "i" disappears in the "fi" ligature, because it merges with the
"f"'s top. This aesthetic adjustment is perfectly innocent in most
languages, because the dot on "i" has no special meaning (it is just an
heritage from hand writing). In Turkish, however, dotless "i" is a separate
letter so, in certain fonts, the ligature looses the distinction between
"fi" and "fI".

Also in this case, the average Turkish user won't need to control anything.
The fonts installed on a Turkish systems won't have "fi" ligatures; or their
glyph will be adjusted to clearly show the dot; or they will be disabled at
the system level, etc.

The problem is when plain text from a Turkish system is exchanged with an
arbitrary recipient; e.g. when a web page is published, or an email is sent
abroad, or (this is the case in my job) a database contains messages in
different languages to allow a multi-lingual GUI interface. In this case, it
would turn useful to be able to insert (manually or automatically) ZWNL
controls to split all "fi"'s.

Of course, this could be addressed with language tagging... It could,
provided that a long chain of ifs (the big one of which is "if this belongs
in 'plain text'") proves true.

I would add a 3rd case. Do you remember Adam Stanislav and his Slovak "CH"
letter? Well, in that case he was probably wrong, and probably there is no
need for such a ligature. Moreover, this hypothetical "CH" would not be any
different, visually, from a "C" + an "H" (especially in upper case). But
imagine that Adam was right, at least partially, and that in Slovak text
there is an actual need to distinguish a genuine Slovak "ch" from the "c"
and "h" sequence in "champagne". Wouldn't ZWL be a very handy way to encode
the *semantic* difference, regardless of the visual presentation?

>People can ask for new ligatures and variants. If they ask them of
>the appropriate vendors -- the developers of fonts, particularly those
>using the technology Mark and John are talking about -- then there
>is a reasonable chance they will get what they are asking for. If,
>however, they besiege character encoding committees asking for ligatures
>and variants to be encoded *as* characters, then they won't.

I would like to stress one point. If I am not totally wrong, Unicode should
be a standard to encode *plain text*.
AAT, OpenType, or any other font technology should not be considered as
*prerequisites* for displaying Unicode.
Or is any particular font technology now *required* by the Unicode standard?
Or is it now "non conformant" to use bitmapped fonts?

If yes, I really have missed the news. I have to retire for a while and
think carefully. Especially, I will have to find out how I am supposed to
use OpenType inside a database field, or on a 16-dots x 40-column receipt
printer.

If not, then these technologies, for clever they may be, cannot be proposed
to solve *encoding* problems, as is being done repeatedly in this
discussion, but only as *one* of the possible scenarios.

I would also like to stress some differences between Michael's proposal and
how I am seeing it:

1) I would not propose that ZWL, in the normal case, is a prerequisite to
the appearance of a ligature. If a fonts maps a glyph to "x<ZWL>y" and not
to "xy" it is because, in that particular font, the "xy" ligature is
considered rare and optional (and, hence, to be manually selected). This
would be the case for an Irish font, designed to be normally used to compose
modern text but that, occasionally, may be used to reproduce old typography.
The most common case would be that ligature happen automatically whenever
the "xy" sequence occurs in the text.

2) Because of (1) above, I advocate the existence of a ZWNL to allow the
user to forbid a ligature that would normally happen.

3) I like the idea that ZWL and ZWNL be unified with ZWJ and ZWNJ,
respectively. This makes the "new" controls readily available *today* and
avoids the "black box" problem (see further). The cost of this (limited, in
my mind) is that Unicode has to re-assign a very small priority to the
irrelevant "European cursive fonts" issue. There is a second toll to pay:
the ZW*L function cannot be used in scripts, like Devanagari or Arabic,
where ZWJ and ZWNJ already have a well-defined meaning. But my feeling is
that the ligating behavior of these scripts is already well defined, or (for
Indic scripts other than Devanagari) will be defined soon with no need for
the new "ligator" concept.

One last thing before taking on with Mark's scenarios.

Someone compared adding ZW*L to adding italic or bold in plain text. Well,
stop laughing: the idea that these basic text attributes (or, alternatively,
abstract emphasis indicators ŗ la HTML <em> ... </em>) belong in plain text
is to be considered carefully. In fact, italics often carry a meaning.
Someone already noticed that a word like "chat", in English, could be
understood and pronounced in very different ways depending whether it is in
italics or not. Moreover, when we quote someone else's text, we feel
perfectly free to change the font face or size, the color, the alignment.
But italics are normally retained, being considered part of the quotation!
And when an author deliberately changes this attributes, (s)he normally
takes care to warn the reader: "the italics are mine".

It is not unlikely that 3rd millennium "plain text" concept could be
enhanced to contain such information. By the way, it would not be more
complicated to implement than supporting some existing BIDI controls. So, as
a minimum, I would not consider this as a valid argument against ZWL.

Mark E. Davis wrote:
>With both font technologies, control can be offered to the user. Let's make
the
>scenarios more specific.
>Suppose that the user has a "AAT/OpenType" font that has the ligature glyph
for
>"ct". If he/she doesn't, then a "ct" ligature will never appear in any of
the
>following scenarios (scenarii?).

False. This is a big planet, why couldn't someone come up with a "ct"
ligature in a completely different technology?

(BTW: My old English teachers would have said "scenarioes", but she was a
2nd language teacher too. In Italian it would be "scenari" or, even better,
"scenarÓ"; "scenarii" is very obsolete and is not to be seen in 20th century
texts).

>Note that the sequence "ct" will be rendered as a
>"ct" ligature if the font has that ligature on by default (ON), and just as
"ct" if
>it is off by default (OFF)

And, of course, the non-ligated rendering will be used in any case if the
font has no glyph for that ligature.

>Here are the different cases:
>
>A. The keyboard supports ZWL.
> A1. Before typing ZWL:
> ON: the user sees "ct" ligature.
> OFF: the user sees "ct".
> A2. After typing ZWL:
> A2a. The font doesn't know about ZWL.
> ON&OFF: The user sees "c", <black box>, "t"
> A2b. The font at least knows that ZWL is in the range of format
characters, and
>renders it invisibly.
> ON&OFF: The user sees "ct"
> A2c. The font supports the triple "c", "ZWL", "t".
> ON&OFF: The user sees "ct" ligature.

A2a: The black box problem is not specific to this discussion. Any newly
added character would display as a black box on older systems: this is very
bad, but the problem disappears as products get updated. Moreover, if ZWL is
unified with ZWJ, A2a falls in the less severe A2b.

There also is another possibility:

        A2d. A particular font (or, well rendering system) knows that, in
ON, it already uses all the available ligatures. ZWL is useless for such a
font, so it is removed during glyphization; "c<ZWL>t" becomes "ct", and the
usual ligature is used.

>B. The keyboard supports ZWNL.
> B1. Before typing ZWNL:
> ON: the user sees "ct" ligature.
> OFF: the user sees "ct".
> B2. After typing ZWNL:
> B2a. The font doesn't know about ZWNL.
> ON&OFF: The user sees "c", <black box>, "t"
> B2b. The font at least knows that ZWNL is in the range of format
characters, and
>renders it invisibly.
> ON&OFF: The user sees "ct"

B2a: Same as A2a above.

B2b: Although inadvertently (or "transparently"), the system adopts the
correct behavior.

>C. The GUI supports explicit ligature formation (EON or EOFF).
> B1. Before choosing setting
> ON: the user sees "ct" ligature.
> OFF: the user sees "ct".
> B2. After choosing setting
> B2a. The font doesn't know about or support setting
> ON: the user sees "ct" ligature.
> OFF: the user sees "ct".
> B2b. The font supports the setting
> EON: ON&OFF: The user sees "ct" ligature
> EOFF: ON&OFF: The user sees "ct"

This system uses an higher-level alternative protocol, so it is a little bit
out of scope. To fully integrate the new ZWL/ZWNL in such a system, 3 steps
should be taken:

1) Forbid the use of ZWL and ZWNL while editing. They are not needed here,
and their only effect would be to mess up the existing higher-level
protocol.

3) Convert EON/EOFF sequences to equivalent ZWL/ZWNL separators when the
rich text is exported to plain text (provided that such a function exists).

3) Convert ZWL/ZWNL separators to equivalent EON/EOFF sequences when plain
text is imported in the rich text (provided that such a function exists).

>In all three scenarios, one can control whether the user sees a "ct" or a
"ct"
>ligature, if all the appropriate support is there. If all the appropriate
support is
>not there, they degrade in different ways, as one can see by the different
cases.
>Lacking complete support, at best, nothing will happen. In some cases,
typing a ZWL
>will actually break a ligature that would otherwise appear. At worst (with
the
>format characters), the user will see "c" <black box> "t".
>The one further interesting scenario is text interchange:
>- When going from rich text (in C) to plain text, one loses the explicit
ligature
>setting. One does not lose the ZWL/ZWNL characters. The result on the other
end of
>the interchange depends on the capabilities there, of course. Sending a ZWL
or ZWNL
>may not achieve the desired result, if the end user just sees a black box.
>- In going between different rich-text formats, it depends on the
capabilities of
>the formats. Looking at the markup world, XSL does support EON/EOFF, while
CSS does
>not yet.

See above for interchanging to/from plain text. The interchange between
different rich text system is quite out of topic. However, ZWL/ZWNL could
also the basis for the interchange to/from rich text systems that don't have
their own mechanism to do this, with the same limitations and problems as in
plain text.

Ciao.
        _ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT