Re: Umlaut and diaeresis

From: Kenneth Whistler (
Date: Mon Jun 21 1999 - 15:04:31 EDT

Donald Figge continued this discussion:

> Because these two characters are unified, the composition software needs to
> be smart enough to know that a word can be divided between two vowels when
> one of them has a diaeresis mark, but not necessarily if the same mark is
> intended to serve as an umlaut.

By "composition software" here I am presuming that you mean software
which "composes" text, i.e. lays it out for visual rendering on some
display device. (Usually, we abstract that concept as "the rendering
process" when talking about Unicode.)

In any case, once an implementation of the line-breaking algorithm
(cf. Unicode Technical Report #14) deals with the determination of
line-breaking opportunities (and prevented line-breaks) based on
pairwise comparisons of line-breaking properties of individual characters,
you still have the problem of determining syllabification breaks inside
words that otherwise consist of runs of characters that all have identical
line-breaking properties. (e.g. strings of ordinary letters in words
expressed in the Latin script) The traditional way to accomplish that is with
a dictionary (perhaps enhanced with a bunch of language-specific
syllabification rules, depending on your implementation).

So if you want to find out that there is a line-break opportunity before
the diaereses in nave or cop, you do that by checking the entry for
those words in a dictionary. It is particularly important to do that
via lookup, and not via character encoding, in English, since diaereses
are optional, "alien" markup of the orthography, and are often omitted.
In French they are more consistent and required -- but still lexical
in nature.

The reasons not to unify umlaut and diaeresis still stand as Asmus
and others have stated them. They are not overbalanced by the need
for finding syllabification points where the difference between a
diaeresis and an umlaut is drowned in a much larger lexical sea that
is handled by dictionary and language-specific rules.

> The argument that alphabetic characters are pronounced differently in
> various languages but still have the same code point misses the point of my
> original question which is why unification when the umlaut and diaeresis
> have different basic functionalities.

I don't think it missed the point, really. Go back and look further at
the documentation for the combining marks in the Unicode Standard.
The design point is to unify the combining marks on form, not on function.
As others have pointed out, the functions of the combining marks, especially
for the Latin script, are so varied as to render functional unification
(or separation) hopeless. The acute mark may, in one language, indicate
a stress, in another length of a vowel, in another a diacritic modification
of the pronunciation of a vowel or consonant, in another a tone mark.
Each such difference may have implications for how software processes
handling the text elements in question operate.

> The illustration of the period and the decimal point being unified does not
> make a good argument for unification, because in fine typesetting, the
> decimal point is sometimes (depending on the design of the type) slightly
> smaller than a period, and sometimes it occupies a different tabular width.

The encoding decision is made not on the basis of fine typesetting, but
on the basis of distinctions required for plain text (all other things
being equal -- the universal hedge. Hehe.) The fact that in fine typography
a decimal point may be identified distinctly and given a separate appearance
cannot make up for the fact that if it were separately encoded, it would
result in massive confusion in plain text. The charter of the Unicode
Standard is to do the best possible job of encoding the plain text
backbone of fancy text. It is then the job of higher-level processes to
go about the task of making and rendering finer distinctions that need
not be made just to carry the basic information. Not all text processes
should be burdened with making *every* distinction that fine typography
in *every* script requires.

--Ken Whistler

> I am not arguing for a change in the encoding scheme. I am just attempting
> to become enlightened.
> Donald Figge
> //

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT