RE: Slovak and Czech "CH" (was: Re:Mixed up priorities)

From: Marco.Cimarosti@icl.com
Date: Fri Oct 22 1999 - 07:04:39 EDT


I don't have any strong opinion about whether to accept or not Adam's three
new Slovak characters (CH, ch and Ch), but I wish to add a few observations.

Adam insists that "CH is a CHARACTER in the Slovak alphabet"...
This is plain wrong!!! I learned the alphabet when I was 5 years old, but I
first saw the world "character" after my 17th birthday, when I started
reading the manual of my brand new Commodore 64.

Slovak or other alphabets do NOT have characters, they have LETTERS! So one
should rather say that CH is a LETTER in the Slovak alphabet.

Made this distinction between characters and letters, and provided that
Unicode is not "LETTER encoding standard", the letter CH needs not not be
encoded as a character: in Slovak it will be a 2-character letter (just like
it is in English, Italian, and many other languages).

But, but, but... Adam looked at the Unicode charts and saw many other
letters that could well have been composed of two simpler characters, so he
asked "Why them and not my CH?"

The most famous of these letter conjuncts is W, that is just a couple of V's
(or a couple of U's, as the English name suggests). Another well known
exampe is German "Es-Zed", that is just an"ss" or an"sz" sequence (the first
"s" being in the now extinct long form).

There is also the digraph that, though in the Latin language was a mere
typographical ligature of two letters, is now regarded as single letter in
some modern languages.

But what probably lit Adam's national pride are the Croatian digraphs DZ,
DJ, etc...
One could say that W, , and have, at least, a typographical appearance
that is slightly different from UU, AE, and ss. But the Croat digraphs
don't! and they have been included anyway! So why not Slovak CH?

Reading this mailing list, I have learned two things about Unicode that
could help answering this question:

1) Unicode is all PRAGMATICS. Most of the theory and philosophy was added
later (probably to add some nice text to a book with too many charts:-).
This theories are not always solid, and may be adjusted when the need
arises.

2) There is an aspect of the history of Unicode that has the utmost
importance, both practical and theoretical, but is easily forgotten: SOURCE
STANDARDS. Unicode did not arise in a desert: it started as a collection of
entities taken from a set of miscellaneous pre-existing national standards,
such as ASCII (from USA), ISCII (from India), TIS (from Thailand), JIS (from
Japan), and so further. Unicode tends to respect and incorporate the choices
made by these pre-existing "traditions", even when they conflicted with
Unicode guidelines. There are good reasons for this; the main one being
round-trip conversion from/to national standards.

Number (2) explains the alternative between pre-composed and de-composed
sequences for <base letter + diacritic>. The de-composed sequences is
Unicode's choice; the pre-composed sequences is the tax paid to existing
"traditions".

It also explains the presence of the Croatian digraphs: these where already
in a source character set from former Jugoslavia. The former-Jugoslav
standards introduced these conjuncts with the aim of allowing naive
conversions from/to the Latin ("Croatian") alphabet and the Cyrillic
("Serbian") alphabet, and to permit naive sorting of Serbocroatian text
(where DJX goes between DX and DA). At the end of 1999, we would rather do
these kinds of things using mapping tables and collate algorithms... But the
70's was a different age.

And, finally, it also explains Unicode has no CH: it is because in
pre-existing standards from Slovakia (or Czechoslovakia) there was no such a
thing! And the reason why there was no such a thing is probably (as you
suggested) that Slovak programmers have always been particularly good in
their job. So, they did a good analysis back in the 70's, and decided to use
collate tables for sorting text.

Ciao. Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT