Re: various ways of making a specific character

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu May 24 2007 - 09:32:56 CDT

  • Next message: Michael S. Kaplan: "Re: various ways of making a specific character"

    On Thu, 24 May 2007, Otto Stolz wrote:

    > The ways you have outlined are "canonically equivalent", so your
    > software should treat all of them as equivalent.

    Not necessarily. Canonical equivalence is not the same as identity. A
    program may treat canonically equivalent sequences as equivalent, or as
    different. Generally, a program should not expect _other_ programs to
    treat canonically equivalent sequences as different from each other.

    For example, a program that renders characters in visible form often uses
    simple technologies for combining marks, possibly in a way that causes
    differences between visual appearances of canonically equivalent
    sequences. In particular, a program often uses a particular glyph for a
    precomposed character but handles a decomposed form by displaying the base
    character and positioning the diacritic somehow (generally with poorer
    results than the precomposed glyph).

    > Multiple diacritics are placed outward from the base character,
    > so you have to enter
    > - either the base letter, the diacritic which is closer, the diacritic
    > which is farther from the base letter, in that order,
    > - or the base letter with the closer diacritic as one character,
    > then the farther diacritic as the combining character.

    In practice, the combining classes of the diacritics imply that in
    canonical decomposition, the dot below comes before the dot above, so this
    can be regarded as the natural order. But it's not specifically designed
    as the right order. In particular, an application may (and possibly
    should) accept, on input, any of the representations for t with dot
    below and above and possible map all of them to a single canonical
    representation, using some of the Unicode normalization forms.

    It seems natural to use the form
    b) letter t/s with dot below (U+1E6D/U+1E63)+ combining dot above (U+0307)
    as the canonical format, since it is what we get by using Unicode
    Normalization form C and it also corresponds to nature of the
    transliteration system: the dot below is more or less an integral part of
    the construct, whereas the dot above, if used, is an auxiliary sign.
    Moreover, if there is a difference in rendering, this probably gives a
    better result than the fully decomposed form.

    But as regards to dealing with input, it is up to the application and its
    design goals whether the other forms should be accepted (and
    canonicalized) or not. This partly depends on the input methods used.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Thu May 24 2007 - 09:35:15 CDT