Re: Mixed up priorities

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Oct 25 1999 - 16:02:29 EDT


Keld stated:

>
> > > A letter is an element of an alphabet, which itself is a structured
> > > collection of graphic symbols used to represent one or more languages,
> > > having specific elements representing for vowels and consonants.
> >
> > That definition sounds like it covers Ethiopic (though not Cherokee
> > or kata/hiragana).
>
> A letter should also be a character, IMHO.
> Characters encodemeaning, and letters are one of the ways
> to express meaning.

I guess Michael's !Xóõ example was water off a duck's back, then.

Many languages have digraphs as elements of their alphabets. In
the sense that Michael proposed for "letter" above, those digraphs
are letters. Some languages of Europe have trigraphs (Hungarian) or even
quadragraphs (Polish) as letters in this sense. The !Xóõ example
shows that that is not the limit: "dts'kx'" is a unit in the phonology,
and given the orthography and lexical practice that Michael quoted,
"dts'kx'" is thus also a letter of this alphabet -- a *heptagraph*.
There are hundreds of languages in the Americas, Oceania, Asia,
and Africa that are written in Latin orthographies using multigraphs
for "letters" of their alphabets. The list of such multigraphs would
run into the thousands and would take years to compile and verify.

The reasonable alternative, long ago decided upon, is to treat these
multigraphs as already encoded in the UCS by their component
parts. Language-specific applications that must treat a particular
digraph (or multigraph) as a unit for some process (such as searching,
sorting, or boundary identification) can then do so appropriately. Standards
such as 14651 (International String Ordering) are under development to assist
in the definition of mechanisms for implementing such appropriate behavior
in particular, language-specific contexts.

The alternative is to import the language-specific orthographic complexity
of multigraphs directly into the character encoding -- by encoding
"ch" for Slovak, Czech, and Spanish, "dts'kx'" for !Xóõ, and thousands
of others from an unspecified and never-ending list. That would vastly
complicate text processing for *all* applications.

Elementary engineering design dictates which is the appropriate direction
to take.

Just because the Unicode Standard is the *universal* character encoding
does not mean that it should be turned into the universal catalog
of all "things" relevant to the display, rendering, interpretation,
or meaning of text. It is *not* intended as the universal glyph catalog.
It is also *not* intended as the universal letter catalog (i.e. the
listing of all "things" which have the status of an element in some
alphabet for some orthography for some language somewhere).

Just because the Unicode Standard is the only "universal" encoding
game in town at the moment is not sufficient reason to keep pushing
these inappropriate types of cataloging activities on the relevant
*character* encoding standards committees. The proponents of
a universal glyph catalog or a universal letter catalog should
develop their own cataloging efforts instead of pressing the
committees to encode inappropriate things in the universal character
encoding.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT