Re: Slovak and Czech "CH" (was: Re:Mixed up priorities)

From: Trond Trosterud (
Date: Sat Oct 23 1999 - 16:43:37 EDT

Paul Keinnen:
>If we accept that the Slovak letter "CH" is encoded as <C><H>, then
>how should the sequence of letters C and H (e.g. in a foreign word or
>other occurrences of such sequences) be encoded that is not to be
>handled as the "CH" letter. Perhaps <C><ZERO WIDTH SPACE><H> or
>something similar, in order to get hyphenation and sorting to work
>correctly without a dictionary ?

But it should not work that way, it should not distinguish between the two c-h.

My guess is that "Christchurch" (a NZ town) is found under CH (Ch?) and not
under C in any Slovak Atlas.

The same problem may be spotted in Norwegian: In names (but not in ordinary
nous), "" ( = IPA [o:] is written <aa>, so the word "sen" (the Hill) is
more often than not written "Aasen" when it is a (not uncommon) name. Thus,
our telephone directories (and indeed sorting algorithms on any
Norwegian-localised computer) sorts "aa" as identical to "", i.e. in the
end of the string (our alphabet is "".

But then we have the Finnish immigrants, with their long vowels, and names
like "Alvar Aalto" (the architect), etc. ("aa" being quite common). Now, in
the user legend initially in the telephone directory we are warned that
"aa" pronounced as "" (norwegian and danish names, that is) is sorted as
"", whereas "aa" pronounced as long a (in Finnish, Dutch and German names,
for example) is sorted as a string of two a-s.

The problem is that this does not work. Finnish names are sorted under ""

Probably we do not want several sorting algorithms mixed (welsh, slovak and
english ch in place-names sorted according to three algorithms ion the same
Atlas, say).

A more serious problems with having a unique code for dz (or for ch) is
that you may have problems in info retrieval: You can never be sure that
some applications failed to give the hex value for dz, giving d+z instead,
and you will thus have to conduct two searches, one for dz and one for d
and z following each other. Not an ideal situation (luckily, the slovaks
will not face that problem, despite Adams suggestion).

But, as stated many times, this has nothing to do with the encoding of c
and h. Here, as a true friend of Platos dialogues, I cannot but refer you
to Peters dialogue (22.10, 08.51), that says what has to be said.

As for the latest discussion, Michael and Mark hit the mark [pun not
intended], although with different terminology they do say the same. Gregg,
on the other hand, is led astray both on the [[meaning]] and the //form//

If by the meaning of a grapheme we mean the sound it represents, then
(e.g.) [[aa]] differs in Norwegian and Finnish, and [[o]] in Norwegian and
German. The first case is relevant for sorting, although not for UCS
encoding (which is why the telephone book does not get it right when it has
to deal with mixed lg databases). Thus, meaning is not relevant.

Since Gregg defines "//X//" as "the form of character X", it is not correct
when he states that "Michael's seems to say "//ch// is two characters".
Michael, using the same character definition as we find in 10646 and
Platos last dialogue, does not equal "character" with "form, shape".
Again, look at the references cited earlier in the discussion.

Well, at least we have forced ourselves to have a second look at our
definitions (this is probably why we all are so interested in this
discussion). Adam has perhaps become a bit wiser as well :-)


Trond Trosterud t +47 7764 4763
Finsk institutt, Det humanistiske fakultet h +47 7767 3639
N-9037 Universitetet i Troms, Noreg f +47 7764 4239

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT