Re: Latin ligatures and Unicode

From: Gary Roberts (
Date: Wed Dec 29 1999 - 18:09:33 EST

Search engines that are not markup savvy (including my eyes) find markup
more disruptive than a single character, particularly if it occurs

Search en-gines that are not mark-up sav-vy (in-clud-ing my eyes) find
mark-up more dis-rup-tive than a sin-gle char-act-er, par-tic-u-lar-ly if
it occ-urs fre-quent-ly.

Search en<SH>gines that are not mark<SH>up sav<SH>vy (in<SH>clud<SH>ing my
eyes) find mark<SH>up more dis<SH>rup<SH>tive than a sin<SH>gle
char<SH>act<SH>er, par<SH>tic<SH>u<SH>lar<SH>ly if it occ<SH>urs

Note that I have chosen relatively non-intrusive markup. When I edit text
with markup, I want to see both WYSIWYG and exact markup at the same time.
This is not possible, but the single character is less intrusive to the
WYSIWYG model than the markup. On the other hand, seeing

<Paragraph type="Normal">
This is a paragraph of text.
</Paragraph type="Normal">

isn't that intrusive. In fact, the latter markup hardly hinders
a plain text search engine. Of course, my single character soft hyphen
does hinder my plain text search engine, but at least I am not talking
about an arbitrary distance (e.g. egrep search pattern en?gines will
produce less false positives than en.*gines and does not require me to
know the exact spelling of the markup I am looking for, as the whole
purpose of my search, for example, was to figure out the answer to that
very question. Note that using a WYSIWYG editor would not help me find
that out either.)

On Wed, 29 Dec 1999, Tex Texin wrote:

> Gary,
> Interesting comments.
> I don't see why adding a character is preferable to adding a markup
> that operates on a point in-between characters. Seems to me if I have
> a mechanism such as a markup language, I would like all commands to go
> thru the markup and not have an alternative mechanism for markup that
> operates at a point rather than a span.
> Having two mechanisms certainly makes it more difficult to design and
> implement while insuring the two interoperate and interact reasonably.
> Certainly, a good tool would provide easy keyboard generation of the
> markup, just as easily as adding a character would require keyboard
> generation of the character, so input is not the issue.
> tex
> Gary Roberts wrote:
> >
> > Yes. There is definitely an issue of how to accomplish what one wants in
> > a way that will be implemented. For example, if the solution relies on
> > language tags (e.g. dictionary based solutions), then it is of little use
> > if companies don't provide support for your language. On the other hand,
> > the soft hyphen is generally implemented, and supports languages that
> > haven't even been invented yet. Now, one could argue whether soft hyphen
> > is best implemented as markup or as the addition of a new character. I
> > tend to read and create markup files by hand. My tendency is to prefer
> > markup when there is some span to the markup. The more characters the
> > markup is likely to affect, the more I prefer it to adding a character.
> > Soft hyphen is an example where there is no span at all, and it makes
> > sense to solve the issue with a soft hyphen character. I see ZWL
> > as a substitute for markup having a span of two or three characters, which
> > still makes it attractive as a new character sollution. It also seems
> > more flexible. Say that I often deal with fonts that have only ligature
> > pairs, given the choice of ff i or f fi, I always prefer ff i,
> > but my colleague prefers f fi. We both prefer ffi as a single ligature
> > if it exists in the font. What markup gives each of us the results we
> > prefer? For &=ZWL, the answer is f&fi for me, and ff&i for my colleague.
> > Note that ZWNL is not useful for this case. I can speculate at the
> > appropriate markup language, but I'd rather hear how others have actually
> > solved this problem.
> > *
> >
> > On Wed, 29 Dec 1999, Asmus Freytag wrote:
> >
> > > What is at the heart of this recurring request is that support for many
> > > scripts
> > > (or older typographies) is incomplete without an *interchangeable*
> > > method of indicating the precesence or absence of ligatures.
> > >
> > > Plain text used to be the *only* medium with near universal
> > > interchangeability. With the web, this has changed. It is now appropriate
> > > to move this discussion on a higher plane and consider the question
> > > differently:
> > >
> > > What is the best way to interchange text containing ligature on the web?
> > >
> > > Posing this question allows us to consider the full-featured typorgraphic
> > > and aesthetic requirements for ligation - as well as any inherent
> > > regularities. Once we have a design in place for interchanging ligatures
> > > with marked up text, we can revisit that and see whether replacing markup
> > > instructions by character codes gives better results.
> > >
> > > I feel we have explored the semantic aspects of this long enough to
> > > conclude that there is some evidence that a ZWNL is linked slightly more to
> > > the underlying semantic content of the text than a ZWL, but that for
> > > neither case we have enough to settle the argument in favor of making them
> > > characters today.
> > >
> > > Both concepts ('ligate here', 'don't ligate here') can in principle be
> > > expressed with HTML or XML style markup - I have seen too little discussion
> > > of what this markup should be like, and what the consequences are of it
> > > being present in the middle of words. Is that something that the HTML/XML
> > > community wants to deal with?
> > >
> > > The next question, assuming that we agree on what ligation commands look
> > > like in markup, concerns interchange between parts of a program, e.g. text
> > > processor to rendering engine. Is it meaningful to have character codes at
> > > that level, or is it more typical that each ligature is it's own little
> > > style run.
> > >
> > > The strongest arguments in favor of character codes come from those who
> > > have for long time needed to 'trick' various applications into supporting
> > > languages
> > > that they were not explicitly designed for. If character codes would result
> > > in 'enabling' many of these implementations, by letting the author
> > > communicate with the rendering engine, so to speak, that is itself a valid
> > > argument to consider. (It would need some actual case studies where this
> > > approach is shown to work).
> > >
> > > Still, even that would need to be contrasted with the cost to applications
> > > that do not know about these as characters and end up showing 'boxes'.
> > >
> > > A./
> --
> Spanish Proverb: Don't speak unless you can improve on the silence.
> Tex's Proverb: Don't email unless you can improve on the screen saver.
> Progress Software: The #1 Embedded Database
> -------------------------------------------------------------------------------------------------------
> Tex Texin Director, International Products
> Progress Software Corp. Voice: +1-781-280-4271
> 14 Oak Park Fax: +1-781-280-4949
> Bedford, MA 01730 USA
> -------------------------------------------------------------------------------------------------------

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT