From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu May 24 2007 - 09:32:56 CDT
On Thu, 24 May 2007, Otto Stolz wrote:
> The ways you have outlined are "canonically equivalent", so your
> software should treat all of them as equivalent.
Not necessarily. Canonical equivalence is not the same as identity. A 
program may treat canonically equivalent sequences as equivalent, or as 
different. Generally, a program should not expect _other_ programs to 
treat canonically equivalent sequences as different from each other.
For example, a program that renders characters in visible form often uses 
simple technologies for combining marks, possibly in a way that causes 
differences between visual appearances of canonically equivalent 
sequences. In particular, a program often uses a particular glyph for a 
precomposed character but handles a decomposed form by displaying the base 
character and positioning the diacritic somehow (generally with poorer 
results than the precomposed glyph).
> Multiple diacritics are placed outward from the base character,
> so you have to enter
> - either the base letter, the diacritic which is closer, the diacritic
>  which is farther from the base letter, in that order,
> - or the base letter with the closer diacritic as one character,
>  then the farther diacritic as the combining character.
In practice, the combining classes of the diacritics imply that in 
canonical decomposition, the dot below comes before the dot above, so this 
can be regarded as the natural order. But it's not specifically designed 
as the right order. In particular, an application may (and possibly 
should) accept, on input, any of the representations for t with dot 
below and above and possible map all of them to a single canonical 
representation, using some of the Unicode normalization forms.
It seems natural to use the form
b) letter t/s with dot below (U+1E6D/U+1E63)+ combining dot above (U+0307)
as the canonical format, since it is what we get by using Unicode 
Normalization form C and it also corresponds to nature of the 
transliteration system: the dot below is more or less an integral part of 
the construct, whereas the dot above, if used, is an auxiliary sign. 
Moreover, if there is a difference in rendering, this probably gives a 
better result than the fully decomposed form.
But as regards to dealing with input, it is up to the application and its 
design goals whether the other forms should be accepted (and 
canonicalized) or not. This partly depends on the input methods used.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Thu May 24 2007 - 09:35:15 CDT