From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu May 24 2007 - 09:32:56 CDT
On Thu, 24 May 2007, Otto Stolz wrote:
> The ways you have outlined are "canonically equivalent", so your
> software should treat all of them as equivalent.
Not necessarily. Canonical equivalence is not the same as identity. A
program may treat canonically equivalent sequences as equivalent, or as
different. Generally, a program should not expect _other_ programs to
treat canonically equivalent sequences as different from each other.
For example, a program that renders characters in visible form often uses
simple technologies for combining marks, possibly in a way that causes
differences between visual appearances of canonically equivalent
sequences. In particular, a program often uses a particular glyph for a
precomposed character but handles a decomposed form by displaying the base
character and positioning the diacritic somehow (generally with poorer
results than the precomposed glyph).
> Multiple diacritics are placed outward from the base character,
> so you have to enter
> - either the base letter, the diacritic which is closer, the diacritic
> which is farther from the base letter, in that order,
> - or the base letter with the closer diacritic as one character,
> then the farther diacritic as the combining character.
In practice, the combining classes of the diacritics imply that in
canonical decomposition, the dot below comes before the dot above, so this
can be regarded as the natural order. But it's not specifically designed
as the right order. In particular, an application may (and possibly
should) accept, on input, any of the representations for t with dot
below and above and possible map all of them to a single canonical
representation, using some of the Unicode normalization forms.
It seems natural to use the form
b) letter t/s with dot below (U+1E6D/U+1E63)+ combining dot above (U+0307)
as the canonical format, since it is what we get by using Unicode
Normalization form C and it also corresponds to nature of the
transliteration system: the dot below is more or less an integral part of
the construct, whereas the dot above, if used, is an auxiliary sign.
Moreover, if there is a difference in rendering, this probably gives a
better result than the fully decomposed form.
But as regards to dealing with input, it is up to the application and its
design goals whether the other forms should be accepted (and
canonicalized) or not. This partly depends on the input methods used.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Thu May 24 2007 - 09:35:15 CDT