Re: Normalization Form KC for Linux

From: Dan (
Date: Sat Aug 28 1999 - 04:56:19 EDT

> > There is no immediate advantage from using combining characters.
> Certainly there is, if you're doing anything with linguistics, some bits of
> trivial math, an "unusual" language, etc. The presence of even simple
> combining mark capability in the display opens up possibilities for plenty of
> languages without any extra support being required beyond a typical Latin
> font.

And if you work with linguistics, an cannot be decomposed when you
work with Swedish, as it is a single letter. The dots above are not an
accent or diacritic mark. So here is a case where you need to
be able to represent what looks like the same glyph "an a with
two dots above", both as one character and as an a with combining dots.

> And part of the whole point of this exercise is to have a single
> character encoding standard. Right?

Yes, but not to make it more difficult than needed. ISO 10646 is
good (including all 31 bits). Some of the things Unicode have added
are good, some are bad.

> Maybe I really should shut up... I guess I'm bitterly disappointed in how
> the Unix and Posix community has not grasped the Unicode textual concepts and
> progressed or led the way in all of this. The community seems so insular
> and fossilized, when there are so many good things about Unix that have been
> poorly imitated by other popular platforms. These days the industry is
> moving right along doing all kinds of interesting display and many scripts &
> languages, while the academic Unix (and Posix) folks are complaining that
> it's too hard or can't be done at all.[1]
> Oh sigh...

I could say tha same, but for different reasons.

For example:
- having non spacing combining characters after instead of
before base character.
- not thinking about reality when constructing UTF-8 and UTF-8
readers/writers. It is simple to make a UTF-8 reader that
accepts ISO 8859-1 or ISO 8859-1 mixed with UTF-8.
It is simple to make a compacter version of UTF-8 using the base
256 character codes were possible (comacter for many languages).
If I today use most tools handling UTF-8 they will stupidly abort
reading my files, because they are all in ISO 8859-1. And they will
not write ISO 8859-1.
When I write my software that handles ISO 10646, it will be able to
read UTF-8, ISO 8859-1 and ISO 8859-1 with embedded UTF-8. And it
will be able to write ISO 8859-1 with embedded UTF-8 allowing
the data to work both with ISO 8859-1 only tools, and still
being able to handle full ISO 10646.
- not accepting that some glyphs that look like it is a combined
character, is not a combined character but instead a character
in itself and should have a code value of itself.
- only thinking ASCII, when thinking about backward compatibility.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT