Re: Character identities

From: Jim Allan (jallan@smrtytrek.com)
Date: Thu Oct 31 2002 - 22:49:02 EST


In Unicode code point U+308 is applied to COMBINING DIAERESIS. There
are a number of precomposed forms with diaeresis.

Let's take one of these, ü:

    * The diaeresis may mean separate pronunication of the u, indicating
      it is not merged with preceding of following letter but is
      pronounced distinctly, as in the classical Greek name Peirithoüs
      or Spanish antigüedad. Similarly in Catalan. It is identified with
      the Greek dialytika of the same meaning, which is indeed the
      ultimate known origin of the symbol.

    * The diaeresis indicates umlaut modification of u, as in German
      über, a use also found in Finnish, Turkish, Pinyin Chinese
      Romanization and in many other languages.

    * In Magyar indicates a sound like French eu.

    * In IPA it indicates u with a centralized pronunciation.

There are may be other phonic interpretations.

Of these uses, only for the second (and possibly the third), might
combining superscript e be used instead of the diaeresis. The second
certainly represents the most common use of ü tody, but not the only
only one.

Unicode encodes the character COMBINING DIAERESIS, not a generic UMLAUT
MARKER which might take various forms. It provides itself no way of
distinguishing between uses of diaeresis.

All the above uses might occur in German text, or Swedish text, or
Finnish text or any text which might introduce personal names or
geographical names or particular words or phrases from various languages
outside the main language of the text. The same applies for ä and ö.

Indeed individual words with vowels and umlaut marker, whether
represented as a COMBINING DIAERESIS or COMBINING LATIN SMALL LETTER E
or following e may appear in text in any language because of use of
technical vocabulary, eg. Senhnsücht, or in personal or place names.

Now any use of diaeresis meaning umlaut in any language might, it seems
to me, be reasonably replaced by superscript e meaning umlaut. But it is
incorrect to replace diaeresis used for any other purpose by superscript e.

In stright, plain Unicode, if you want to use diaeresis for umlaut, use
diaeresis. If you want to use combining superscript e to indicate
umlaut, use COMBINING LATIN SMALL LETTER E. Leave any other
occurrences of umlaut alone. This is the only possiblitiy at the plain
text level, and the most robust way of chosing between diaeresis and
superscript e at any level.

Given a higher protocol, we can do more. We might, as suggested, have a
font which uses superscript e instead of diaeresis, at least for the
combination characters with the base characters a, o, or u and in place
of the diaeresis symbol itself. If we have another generally identical
font with a true diaeresis instead, we can switch between fonts as
necessary depending on whether diaeresis is used for umlaut or not, or
whether in particular cases we wish to use one or the other symbol for
umlaut.

Switching between such alternate fonts as long been a standby when fancy
typography is required.

Yet I don't see there is any advantage to switching betwen between fonts
and switching between the Unicode character COMBINING DIAERESIS and
COMBINING LATIN SMALL LETTER E. And it makes us dependent on a
particular set of fonts. That is probably not good. :-(

A better solution might be an intelligent font that recognizes some
kinds of tagging and which allows us to turn on different glyphs for
diaeresis according to the tagging, one of these glyphs being a
superscript e. So we tag words and phrases. And, magically, if that
particular font works properly, we see diaeresis where we want diaeresis
and superscript e where we want superscript e.

But it is not evident that tagging for this purpose is any easier than
entering the different Unicode characters from the beginning. And we are
again dependent on the intelligence of a particular font. Of course, we
might expect there will be soon be many such intelligent fonts. It is
less likely that they will all work exactly the same, and understand
exactly the same tags in the same way. And we are restricted to such
intelligent fonts as understand a particular system of tagging rather
than using almost any font. :-(

We might propose introducing a tag or indicator of some kind at some
level to indicate a diaeresis has umlaut function, but such a tag or
indicator would probably only be used when a user wanted to use a
superscript e, in which case it is not clear that using it would have
any advantage over actually entering COMBINING LATIN SMALL LETTER E. :-(

We might go to a still higher level of protocol, to a routine or plugin
in an application or a new style feature added to HTML or XML which
allows diaeresis replacement. Just as Microsoft Word and some other
programs now allow capitalization and small capitalization as an effect,
though the underlying text is still actually in upper and lower case, so
we might show a diaeresis as a superscript e, though in fact at the
plain text level the text has a diaeresis. Presumably for viewing and
printing the application would substitute Unicode COMBINING LATIN SMALL
LETTER E without actually changing the underlying text.

We might eventually be able to translate between applications globally.

Yet ....

Is it not simpler and easier and far more robust that search engines
begin to recognize a weak equivalence between COMBINING LATIN SMALL
LETTER Eand diaeresis and that text processing applications,
particularly ones intended for use with German, allow easy
user-controlled interchange of diaeresis and superscript e at the
Unicode plain text level without particular font dependencies? :-)

The user might not even know the characters are represented by different
code points.

The diaeresis is less universally a version of a superscript letter e
then the cedilla is a version of the letter z, but one would probably
not want any normal font to replace ç with z topped by a superscript c.
 The cedilla has long lost its unity with z.

Similarly one would not normally want a font to replace å by aa or th by
þ, or a font for French that replaces the circumflex accent with
COMBINING LATIN SMALL LETTER S, though such substitutions might also be
considered as stylistic from some points of view. A font is the wrong
level to make such substitutions robustly.

Again, should IPA symbols be replaced by the corresponding characters in
Americanist phonetic useage by a font? This would could quite reasonably
be argued to be only a stylistic change. The characters mean the same,
after all.

But Unicode generally encodes characters not glyphs; and encodes
characters, not their meanings.

Jim Allan



This archive was generated by hypermail 2.1.5 : Thu Oct 31 2002 - 23:35:01 EST