When is glyph decomposition warranted?

From: Dean A. Snyder (dean.snyder@jhu.edu)
Date: Sat Aug 28 1999 - 23:03:23 EDT


INTRODUCTION

I'm fairly new to the Unicode list, so forgive me if this topic has been
covered before, but I thought that Marcus Kuhn raised such interesting
issues concerning glyph decomposition/precomposition that it called for a
new thread, and maybe some clarification within the community.

I would think that the topic of glyph decomposition would have a long
history in these circles and that there would be some sort of Unicode
sanctioned guidelines for deciding such issues. Is such the case?

(Note: in my remarks below I am specifically leaving out of the discussion
the graphic, dingbat, etc. symbol areas in Unicode.)

PHILOSOPHY OF GLYPH DECOMPOSITION

In the "Normalization Form KC for Linux" thread Marcus Kuhn wrote:

> My reasons for staying with precomposed characters in the Unix non-GUI
> environment for quite some time are:

...

> More philosophical (and therefore more fun to discuss :):
>
> - I also fail to see why a decomposed form should in any way
> be more natural. I see the decomposed form more as a technically necessary
> brief intermediate step for rendering fonts that provide font
> compression by storing commonly occurring glyph fractions (e.g., the base
> glyphs and accents, hooks, descenders, etc.) separately and combine
> them only on demand at rendering time. The choices made about which
> glyph components (and yes, we talk about glyphs and not characters
> here) deserve to become Unicode characters on their own right do not
> appear to be very systematic to me and seem to me to be more influenced
> by historic perception than by a clean technical analysis. I have to
> agree with the argument that there is no reason, why "ä" can be decomposed
> into a + ¨, but "i", "j", ";", and ":" can't be decomposed into a
> sequence with a dot-above combining character. After all, all of
> them exist also without the dot above, and many also with
> many other things above (iìíîï). Why isn't Q represented as an
> O with a lower-left stroke? Because all these precomposed characters
> have just stopped to be perceived as being composed by those who
> designed Unicode and its predecessors (ASCII, Baudot, Morse, etc.)
> Nevertheless, G is historically a C combined with a hook, Ws are
> two Vs (or Us) with a negative space in between, + is just a
> "not -" and therefore crossed out, $ = S + |, and @ is just an
> "a" in a circle. It would be just fair to decompose ASCII before
> you start treating the ä as a second-class citizen. :)

Marcus' reason given here for sticking with precomposed glyphs in non-GUI
Unix was introduced as "More philosophical (and therefore more fun to
discuss". I agree with the "fun" part. But I also see the topic as both
serious and crucial, since one's views here are potentially determinative of
one's methodology in deciding whether or not to decompose a given glyph.

> The choices made about which glyph components ... deserve to become
> Unicode characters on their own right do not
> appear to be very systematic to me and seem to me to be more influenced
> by historic perception than by a clean technical analysis.

There seems to be a categorical denigration of "historic perception" here
along with a mutual exclusion with "clean technical analysis" that I do not
understand.

In what way does a "clean technical analysis" of encoding issues necessarily
preclude "historic perceptions"? What if the linguistic history embodied in
a decomposed glyph is precisely what one finds useful, both for cultural and
computational reasons? How then would that not be just as "clean" and
"technical" an analysis as any other?

I could be wrong but I believe that the examples given to illustrate this
"pollution" of the technical analysis actually illustrate confusion between
glyph's visual and linguistic history.

SIMPLE EXAMPLE

Marcus wrote:
> Why isn't Q represented as an O with a lower-left stroke?

(Or more provocatively I would ask, why aren't certain capital letters
represented as miniscules writ large, that is accompanied by an abstract
"bigger" glyph (c/C)? Or why isn't Z represented as N plus an abstract
"rotate left" glyph?)

The example given above, Q and O, exhibits visual similarity alone with no
linguistic reality behind it.

In a similar vein, and one nearer to my interests, the hundreds of glyphs
making up Akkadian cuneiform are composed from a handful of basic elements,
such as:

   / o o o / and o---
   \ | / \ o

Are the syllabograms then candidates for glyph decomposition? I say no,
because the similarities across multiple syllabograms is purely visual, and
has little utility (a simplification) to Akkadian scholarship.

So the short answer as to why there is no decomposition in such examples is
that encoding engineers have not found these decompositions to be "useful".

The long answer, I suggest, is that, for human language oriented encodings,
encoding engineers have decided it is not useful, for cultural or
computational reasons, to decompose glyphs into separate elements based
solely upon visual similarity, but rather to decompose only those glyphs
having elements which are perceived as reflecting (sometimes historical)
linguistic realities.

(One can, of course, envision legitimate encoding schemes in which
non-linguistically related glyph decomposition could be useful, such as the
visual kind mentionned above, or an encoding based on the physical location
of keys on various keyboards, but that is another topic, for another
consortium!)

NOT SO SIMPLE EXAMPLES

Marcus wrote:
> I have to agree with the argument that there is no reason, why "ä" can be
> decomposed into a + ¨, but "i", "j", ";", and ":" can't be decomposed ...

Here, I believe, we have confusion of both the linguistic and visual
commonality shared between a and ä and the purely visual commonality of the
superior dot shared across i, j, :, and ;.

The interchange of a with ä in German, for example, has underpinings in
linguistic reality. It represents in some instances a linguistically
predictable, and productive, phenomenon (Faden/Fäden, string/strings), and
in other instances, a non-predictable, phonemic alteration (Bar/Bär,
bar/bear). But in both cases we are dealing with phonetically related,
(almost) homorganic, vowels, one of which was apparently perceived
historically as a modification of the other - "the umlaut is added to the
a".

The decomposition of ä into a and ¨ (along with the related pairs in German,
o/ö and u/ü) can be useful for searching, sorting, and morphological
parsing, since the umlauted vowels can be algorithmically related to their
non-umlauted counterparts.

One would be hard pressed, however, to find a meaningful linguistic
commonality for the superior dot across "i", "j", ";", and ":". And there
are also troubles even within the more natural subgroupings.

"i", like "a", can occur with other superior marks (e.g., ìíîï) but never,
unlike "a", as far as I know, without SOME superior mark; "j", unlike "a",
has only one form. So the superior dot here carries no real information. One
could write, albeit "incorrectly", both the "i" and the "j" without the dot
and there would be no ambiguity. That cannot be said for "a". So this is
truly a purely visual phenomenon (but a vestige of some earlier linguistic
phenomenon?), and is not a candidate for decomposition.

";" is "stronger" punctuation than ",", while ":" is "weaker" punctuation
than "." - at least in American English. So what is the superior dot here
other than a visual adjunct? But even if the intuitive hierarchical
relationships held true in the language and time when these punctuation
marks originated (and my guess is that they would), what cultural or
computational utility would be effected by decomposition now? In other
words, how would one use the information provided by the decomposition? If
decomposition serves no purpose, it should not be done.

COMPLICATIONS

There seems to be a continuum of analytical difficulty between those glyphic
elements that are purely visually related and those whose relationships are
grounded in linguistic realities. The easy cases are ones like those
mentionned so far, but Marcus also raises a slightly more difficult example:

Markus wrote:
> G is historically a C combined with a hook

This, assuming its veracity, is a little trickier, since we apparently have
both visual and linguistic correspondences between C and G, much as we do
between a and ä, that are, however, a little more disguised to the modern
reader. Does the hook act as a modifier carrying linguistic information
(frication/voicing)? Had you ever thought of this connection? (I hadn't.) If
you were aware of it, have you ever made use of it? Do you ever intend to
make use of it?

In such cases, therefore, the real question for the encoding engineer
becomes, "Does decomposition matter? Do we care?". In other words, is it
"useful", in a cultural or computational sense, to perpetuate the historical
relationship between C and G in our encoding scheme? Unlike the a/ä set,
the C/G set, and others like it, has little or no significance to users of
the script, I believe, and therefore should not be decomposed.

I consider "W", historically two "V"s, as merely a parallel example.

CONCLUSION

I tentatively suggest then, for a human language encoding scheme such as
Unicode (ignoring for the moment the graphic and dingbat symbol areas), that
glyph decomposition based upon purely visual criteria is, in general, not
useful, whereas glyph decomposition based upon linguistic criteria MAY be
useful. And the decision whether to decompose or not will be based both on
one's definition of "utility" and on the levels of meaningful discreteness
desired in the encoding.

Have these issues been dealt with formally and explicitly in the Unicode
documentation? If not, wouldn't it be a good idea to do so?

Respectfully,

Dean

P.S.

Just for fun!

> + is just a "not -" and therefore crossed out
Or is - really + with the vertical bar subtracted ("minused")?

> $ = S + |
Or is it an S with two vertical bars? If so how would that be decomposed? As
three slots?

> @ is just an "a" in a circle
Is it a zero, an "oh", or a circle?

How are we to define composite glyphs anyway? -
  historically (G = C + ,(approximation))
  visually (V = \ + /)
  or both (W = V + V = \ + / + \ + /)?

Seriously, are these the true histories of these glyphs?

What about #, %, *, & =? Are they composites historically?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dean A. Snyder
Senior Information Technology Specialist
The Johns Hopkins University
Hopkins Information Technology Services
Research and Instructional Technologies
18 Garland Hall, 3400 N. Charles St.
Baltimore, Maryland (MD), USA 21218

Office: 410 516-6021
Mobile: 410 961-8943
Fax: 410 516-5508
Email: dean.snyder@jhu.edu
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT