Re: Mixed up priorities

From: G. Adam Stanislav (
Date: Thu Oct 21 1999 - 23:51:06 EDT

At 17:19 21-10-1999 -0700, John Hudson wrote:
>Are there separate CH, Ch and ch keys on Slovak keyboards?

Of course not. Keyboards were designed in America. Besides, keyboards are
glyph-oriented, not character oriented. I am not aware of any operating
system that can display two glyphs for a single character (not yet,
anyway). Are we here to accept the status quo, or to internationalize

>Many languages, including English, make use of digraphs and trigraphs to
>represent sounds which are represented in other orthographies by single

Oh, yeah, it's all about English. The rest of us are idiots. English does
not consider digraphs separate characters, and English is right. The rest
of us should just assimilate. Resistance is futile.

Well, fine. Then let's declare Unicode the English way of transcribing
languages, and not call it an international standard of character encoding.

>In some languages these digraphs are considered to be
>individual letters, with specific sorting and hyphenation rules associated
>with them, but is it true that these sorting and hyphenation rules
>_require_ encoding of these digraphs as precomposed characters?

If they are considered individual letters by any language, yes, they should
be encoded separately.

>In some non-Slavic language adaptations of the Cyrillic script, up to four
>letters may be combined to represent a single sound, and these
>'quadragraphs' are often listed as single letters of the alphabet and have
>specific sorting and hyphenation rules. Are you suggesting that each of
>these sequences _needs_ to be encoded as a precomposed character?

I am not talking about transliteration. I am talking about native use. If
some language natively considers a quadragraph a character in its own
right, then yes, we need to encode it. Or we need to stop referring to
Unicode as CHARACTER ENCODING. Either solution is acceptable.

>>The fact that it can be constructed from two glyphs, C and H, is
>>irrelevant, many other characters can be so constructed (e.g. N with caron
>>can constructed from an N and a caron, yet it is a separate character).
>There are plenty of people on this list who would argue that it should not

But the fact is, it is. And as long as Unicode is to be thought of as
character encoding, it should be.

>Again, is it _necessary_ for this behaviour to be controlled by encoding
>these letters as individual, precomposed characters? If there are no CH, Ch
>and ch keys on Slovak keyboards -- as I suspect -- you would still require
>secondary text processing which would recognise the keying of c followed by
>h as ch. What have you actually gained?

Consistency. There is a DZ, for example. It is a character is several
languages (Slovak included). Consistency with Unicode being a character
encoding. Keyboards are not about characters, they are about glyphs. They
evolved from the typewriter, in which characters were not a concern. Glyphs

>Remember that Unicode is a standard for encoding _plain text_.

No, it is a standard for encoding _characters_. It states so quite explicitly.

> Unicode does
>not contain sorting rules for individual languages, nor does it contain
>hyphenation rules for individual languages.

I have never asked to have the CH encoded right after the H and before the
I. That would be sorting. I am not talking about sorting at all. I am
talking about a separate character, which just happens to consist of two

> Unicode provides a standard for
>encoding text which can then be properly handled by secondary text
>processing software, including dictionaries, language specific hyphenation
>algorithms, etc.. The kind of thing you are demanding belongs at this
>secondary level, not at the plain text level.

No, Unicode provides a standard for character encoding, not plain text

Yes, it is possible to encode the CH as the C followed by the H, and the N
caron by the N followed by some connection code followed by a caron. And it
is perfectly possible for software to handle it. But that would not be
CHARACTER encoding. Unicode clearly states its goal to be the encoding of
characters of all languages, existing and defunct. CH is a character is in


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT