Re: Normalization Form KC for Linux

From: peter_constable@sil.org
Date: Sun Aug 29 1999 - 01:16:20 EDT


>And if you work with linguistics, an cannot be decomposed
       when you work with Swedish, as it is a single letter. The dots
       above are not an
       accent or diacritic mark. So here is a case where you need to
       be able to represent what looks like the same glyph "an a with
       two dots above", both as one character and as an a with
       combining dots.

       When you are doing linguistic work, there are inevitably times
       when you need to treat sequences as a unit (e.g. ll or ch for
       Spanish); you may even need to treat discontiguous sequences as
       a unit (e.g. Thai sara ia). So even if Swedish a-umlaut
       (Iguessing that's what you wrote - my mail reader is showing me
       o-tilde) must be treated as a unit for analysis purposes, it
       doesn't matter whether it is encoded as a unit or a sequence.
       You've got to be able to handle sequences in this manner
       anyway. This argument, therefore, does not follow through.

       More generally, our software systems must, for various
       purposes, have the ability to treat n characters as a sequence
       of m units (consider Scottish name sorting which equates Mc and
       Mac). If they don't do this, then they are to that extent
       lacking in their level of internationalisation.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT