Re: Amerindian Characters

From: Peter_Constable@sil.org
Date: Thu Jun 17 1999 - 10:14:14 EDT


Paul:

>I thought that Unicode was supposed to be a character encoding, not a glyph
encoding.

Indeed, that is the case.

>I am a bit worried about the approach by many posters in this thread that every
time a _glyph_ contains a latin character with some extras, it should be encoded
as a basic latin _character_ (A-Z) and combining marks.

>Taking for example Finnish (and Swedish) characters (


è,) Ç and


Í (which
_glyphs_ resemble the latin characters A and O), but these characters are at the
end of the alphabet after Z. Neither is there have a close relationship between
A and


Ç as in German (e.g. das Amt, die Çmter). In Finnish the word with the
base character and the character with diaeresis have a completely different
meaning eg. saari (island) and s


õõri (leg). Fortunately both Swedish and Finnish
have been used in computing since the 7 bit character days and these characters
have been assigned separate code points. If this had not been the case, I guess
Unicode would have assigned these characters base_latin_character+combining_mark
combinations.

I think I understand the the concerns you raise (though in my mail reader your
examples didn't survive). There are two ways to solve the problems involved
here. I think you're thinking of one, while Unicode assumes the other. Let me
elaborate:

Let's suppose that Swedish and Finnish are obscure languages that have never
been given much attention by software developers and that the character Ä
(A-umlaut) doesn't exist in Unicode. Now someone points out the fact that these
two languages need this character, and suggests that it needs to be included in
Unicode. The reason given for not using LATIN LETTER A followed by COMBINING
DIAERESIS is that the Swedish/Finnish letter is not a simply a modified A. For
example, the sort order for these languages begins with A and ends with
A-umlaut.

It would be possible to deal with the collation issue by adding A-umlaut to
Unicode, making sure it comes after Z. But then tomorrow, suppose someone else
says, "Oh, but language X also has an A-umlaut, but it sorts before the Z. So we
need another, and it has to go into Unicode before the Z." Well, obviously, this
is impossible, and we all know that it's not necessary. Nobody assumes that sort
keys must correspond to the character code values.

At this point, you're probably saying that this is too obvious to bother
mentioning. But it's just one more step to say that sort keys don't have to be
based directly on the individual encoded characters without any reanalysis;
e.g., a Spanish sorting algorithm can recognize the sequence "ch" as a single
element for sorting purposes, and situate that element in the sorting order
wherever needed. And if it's possible to do that for "ch", then it's possible to
do the same for LATIN LETTER A + COMBINING DIAERESIS.

The point is, the problems can be solved by adding A-umlaut to the standard, but
they can also be solved without adding A-umlaut to the standard. There is no
process that *requires* A-umlaut to be encoded as a separate, precomposed
character rather than as a decomposed sequence.

This has nothing to do with looking at this in terms of glyphs rather than as
characters. Rather, it's that it isn't necessary for every "ortheme" (i.e. units
within an orthography) from every language to be encoded as a separate character
in Unicode in order to do whatever processing is needed. All that's necessary is
that every ortheme from a given orthography have a unique encoding relative to
the other orthemes in that one orthography. There's no reason why each ortheme
must be encoded as a single encoded character rather than as a sequence of
characters.

Part of the confusion, I think, has to do with the use of the term "character".
It means one thing when we are talking of languages and orthographies (=
"orthemes" here). It means something rather different when we're talking about
an encoding standard. Nobody in this thread has been talking about glyphs;
they've been discussing characters, but of two types.

Since it isn't necessary to include precomposed characters in the standard
(except for round-trip convertibility), then the preference is to avoid adding
them so that the standard doesn't become bloated with them and so that work on
the standard doesn't get completely bogged down with processing proposals to add
them (see Ken Whistler's message). This isn't done with any intent to give
preference to some languages over others, or with any lack of concern to
understand the needs of the orthography of any language. Rather, it's based on
enough experience to know that a lot of needs that are raised have already been
covered adequately. If there is a real need that isn't adequately met, I know
that the Unicode and ISO folks will be interested in considering what's required
to meet that need. It's a rare case, though, in which precomposed characters are
actually needed.

Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT