RE: Amerindian Characters

From: Karlsson Kent - keka (keka@im.se)
Date: Thu Jun 17 1999 - 07:34:43 EDT


> At 12:08 16.6.1999 -0700, Peter_Constable@sil.org wrote:
>
>
> >There really is no reason why pre-composed combination
> characters are needed,
> >and pretty good reasons need to be provided before the
> Unicode and ISO
> >committees will seriously consider adding new pre-composed
> characters. People
> >will sometimes appeal to the fact that other pre-composed combination
> characters
> >have already been added to the standard. In most such cases,
> however, very
> >strong reasons were given: that the pre-composed character
> already existed
> in an
> >existing international encoding standard (e.g. ISO 8859-1).
> In order to provide
> >round-trip convertibility, what was before must live on. Had
> it not been for
> >pre-existing standards, all of these characters may not need
> to have been
> >included.
>
> I thought that Unicode was supposed to be a character
> encoding, not a glyph encoding.

True, except for dingbats (and OCR 'dingbats').

> I am a bit worried about the approach by many posters in this
> thread that
> every time a _glyph_ contains a latin character with some
> extras, it should
> be encoded as a basic latin _character_ (A-Z) and combining marks.
>
> Taking for example Finnish (and Swedish) characters (Å,) Ä
> and Ö (which
> _glyphs_ resemble the latin characters A and O), but these

In good typography, what is beneth the (apparent) diacritic here
looks exactly like the A and O glyphs, respectively, of the same
font. But that is not a requirement of Unicode.

> characters are at
> the end of the alphabet after Z. Neither is there have a
> close relationship
> between A and Ä as in German (e.g. das Amt, die Ämter). In
> Finnish the word
> with the base character and the character with diaeresis have
> a completely
> different meaning eg. saari (island) and sääri (leg).

True (also for Swedish), but neither of these two arguments call for
separate encoding of 'precomposed' characters. And we are likely to
get these graphemes in decomposed form in many applications/systems.
If properly implemented, which might take a while, you would not notice
the difference without essentially looking at the binary codes. They would
sort the same (for Swedish and Finnish at the end of the alphabet), they
would look the same (most likely exactly the same glyph(s) would be
used), the handling in a UI would be the same (e.g. you select
graphemes, rather than the individual character(s) that comprise the
graphemes).

> Fortunately both
> Swedish and Finnish have been used in computing since the 7
> bit character
> days and these characters have been assigned separate code
> points. If this
> had not been the case, I guess Unicode would have assigned
> these characters
> base_latin_character+combining_mark combinations.

Yes, and it does (too). These (example) precomposed characters
(ÅÄÖ) are canonically equivalent to their decompositions. To get this
to work the way you expect with the decomposed form is an
implementation problem, not a problem with Unicode itself.

Please see Unicode Technical Report #15 (Normalization) and
Unicode Technical Report #10 (Collation; Ordering), as well as
The Unicode (2.0) book, chapter 5.

Aside: if you map a Unicode string to (e.g.) Latin-1 (or Shift-JIS, or ...),
the decomposed forms should map to the corresponding precomposed
character in Latin-1 (or Shift-JIS, or ..., as appropriate).

> I do not know Navajo language and I would be _positively_
> surprised if all
> those people defending the Unicode base character + combining
> marks approach
> would be that fluent in Navajo language to tell the
> difference between a
> separate character and an add on to the glyph.
>
> Paul Keinänen

                /Kent K
                (fluent in Swedish...)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT