Order of combining marks [was Re: Normalization Form KC for Linux]

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 30 1999 - 14:48:51 EDT


J. asked:

> Dan <Dan.Oscarsson@trab.se>:
>
> D> I have not yet located why. I can see ways were software can
> D> handle them much easier if they comes before.
>
> I'd like to second Dan in his request for an explanation here.
> Putting combining characters before a spacing character allows you to
> determine the end of the composite as you reach it (think about
> reading a stream of codepoints from a network connection). With
> combining characters after the base character, you need one character
> of lookahead (is the next character combining?).
>
> So why did Unicode chose to put combining marks after the base
> character?
>

This debate was originally held in 1989 - 1990. Periodically since
then (about once a year, on average), someone gets exercised about
the topic and demands to know why Unicode combining characters do
not precede their baseform, as they so "obviously" should.

First of all, we should all understand that this is a closed issue--
decided a decade ago, and enshrined in the international standard.
So in some respects, the answer is just that the standard decrees
it to be so, so deal with it.

At the second level, the basic reasons are spelled out in the
text of the Unicode Standard, Version 2.0, on page 2-15, so you can
find the answer there.

But for the benefit of the list, there were two major reasons for
having combining marks following the base forms.

1. Combining marks are not just for accents applied to Latin letter
baseforms. Combining marks are used as a fundamental part of many
scripts of the world, including Arabic--where they represent vowelings,
as well as diacritics--and Indic scripts, where they also often
represent vowels. The logical representation of a CV (consonant vowel)
sequence in such scripts is first the consonant (usually a baseform
character) and next the vowel (often a combining mark). To have the
combining marks *precede* the consonant in memory representation
would make implementation of such scripts more complex and would
depart radically from established implementations of such scripts
(for example, ISCII for Indian scripts).

2. Modern computer font technology uses advanced image models,
with two-dimensional metrics, with independent concepts of image
width and character (= glyph) advancement. Application of a diacritc
to a baseform is treated as rendering of the baseform, followed by
a calculated two-dimensional move to the start point for the
rendering of the "non-spacing" diacritic mark. This is not
character cell rendering of fixed forms, nor backspace overstruck
technology. It is more straightforward to map a piece of plain text
(a vector of encoded characters) into the appropriate sequence of
glyphs (including all the relative metric offsets) when the combining
marks follow their baseforms -- since that naturally follows the
font models.

Given that Unicode was designed as a plain text encoding, the
combination of reasons 1 and 2 essentially made it a "no-brainer"
that combining marks should follow the baseforms. To the engineers
involved, many of whom had extensive experience with the programming
of software for text rendering, the choice was an easy one. And
once it was decided that combining marks must follow baseforms, there
could not be any halfway measure--all combining marks, whether non-spacing
or not, had to follow baseforms. Any mixture of schemes would have
been chaotic in the encoding.

But why did the arguments for non-spacing mark preceding baseform
seem to unpersuasive to the designers of the standard?

First, the non-spacing mark first model is derivative from old
keyboard technology, as others have pointed out. The habit of
typing a diacritic first on a dead key, and then the baseform,
was dictated by the constraints of typewriter design, and not by
any natural considerations for the way people write their scripts.
But actually, given the nature of computers, that turns out to
be essentially beside the point. Keyboard drivers and input methods
have to sit in between the physical keys and the windowing system's
(or other system's) effective point of character input anyway.
Processing accents can be done in either order, with or without
a formal "dead" key. There are a half-dozen ways to do it for
simple accented letters, with or without intermediate input displayed
to the user. Experience at Apple suggested that people liked
intermediate input feedback, and that baseform + accent was an
acceptable (or preferred) mode and provided better feedback: type
a baseform -- see the baseform -- type an accent -- see the accent
applied to the baseform. But the dead key typing method can be
supported with a memory representation of combining mark after.
As can the compose key/base form/accent or compose key/accent/
base form typing methods.

Second, the argument that with the combining mark first, you know
when the stop, whereas with the combining marks following, you
don't, didn't wash. The problem is that when you are talking about
text rendering, you never know "when to stop" anyway, even for
baseforms. Think Arabic. I type an "sad". The text rendering process
has to do something to put that up on the screen. Next I type
an "lam". The text rendering process not only has to present the "lam",
it has to do the shaping algorithm that modifies the shape of the
preceding "sad". This is no different in concept or complexity from
what a text process has to do to rendering an "a" and then dealing
with a combining accent that follows in its input stream. This
was the insight that Apple had (derived from the text rendering
pioneers at Xerox) that convinced them that logical order of
baseform plus diacritic made sense even for Latin.

Third, if you are dealing with a streaming interface of some sort
that has to wait to collect a combining character sequence to
do something with it as a unit, you have to do buffering
anyway. It is not harder to collect up a combining character
sequence in either order. (Contrary to the implied claim by
Don that it is "much easier" to handle them with the combining
mark first.) The only difference comes when you need
to provide immediate visual display on an incomplete sequence --
and the second point above has already dealt with that.

So the architects of the Unicode Standard found the arguments for
combining marks following to be compelling, and the arguments for
combining marks preceding to be unconvincing and relative easy
to program around.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT