Re: Normalization Form KC for Linux

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Mon Aug 30 1999 - 03:25:54 EDT


>>> >For example:
>>> >- having non spacing combining characters after instead of
>>> >before base character.
>>>
>>> I understood that it's much better to have them after.
>
>I think it's a question of logic. One writes an o and then puts a ~ on top
>of it. On typewriters, they switched this around because of the mechanical
>movement of the carriage and nonspacing technology. But for instance a
>smart font would take the bounding box

It is one think to write a character, it is an other to handle them
in software. For software there are advantages to have combining
characters bfore the base character. The user interface can still
construct characters in a user friendly way.

>So what? Your answer shows exactly what I said: that ä and à differ only in
>the minds of Swedes.

Yes, and it shows that it is generally not natural to decompose every
glyph that looks like it can be decomposed.

But what I really wanted to say is that it is better for interoperabilility
to always use "normalisation form C of Unicode tech. report #15" when
trasmitting textual data.
If you do that you get a compacter form than when using decomposed, and
you avoids the discussion of pre-composed/decomposed. It is also
nearest to the legacy forms used.

-
It will be a very long time before UTF-8 or some other form of
ISO 10646 will be the norm on my system.
I cannot do like Markus Kuhn suggests, recode all my files into
UTF-8 from ISO 8859-1. I have 100:s of GBytes and few tools work
with UTF-8.

To change to use UTF-8 as default the following will have to
happen:

1) First all tools will have to be fixed so they can read and write
both ISO 8859-1 and UTF-8. And when for example a C compiler works
in UTF-8 mode, I must still be able to write:
if (ch == 'ä')
to compare my letter ä with a single character value.
This also means that for example saved e-mail, need to be saved
in UTF-8.
Before all tools are fixed, all must normally write data in
ISO 8859-1 format. ISO 8859-1 with embedded UTF-8 would also
be ok. (Note: some of Markus Kuhn's objections to having a
base256 version of UTF-8 because it lacks essential properties
of UTF-8 do not always matter. You can do an encoding that
fullfills most of UTF-8 properties. And things like preserving
UCS-4 byte sorting is not needed, real sorting cannot be binary
as sorting need to account the the current locale. And UTF-8
does not help in case changing or case insensitive matching.
When doing any serious work on UTF-8 data you need to decode it
into something more useful internally).

2) When all tools are fixed, at that time they can be set to
write UTF-8 encoded files. Full convertion of all files at
onces is still not possible, to much data and to difficult
to identify what is text and what is not. So expect nearly
forever all tools be required to identify and read ISO 8859-1
encoded files to mixed with UTF-8 encoded ones.

    Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT