> My native language, Slovak, uses the digraph "ch", yet I cannot find it
> anywhere in the Unicode standard. Ch, as used in Slovak (and, I believe,
> Czech), is not just two characters typed after each other. It is a
Digraph "ch" is definitely a unit in Slovak, Czech, and a dozen other
languages (see below). Such digraph (and n-graph) combinations are not
coded in Unicode, with a very few compatibility exceptions. The reason is
simply that coding them would make text processing in these and other
languages more difficult. Check out section 2.1 of the Unicode book.
I'll append below a listing I once compiled of Latin digraphs/n-graphs, I'm
sure it is full of inaccuracies but it should give the idea. This list
omits combinations with diacritical marks and with apostrophe treated as a
letter. There is a list of similar length for Cyrillic.
Note that "ch" and many other combinations are digraphs *in English* (E ng l
i sh); though they are never counted as *letters* of English, they are
treated as units in many processes. I'll append some interesting old
comments by Glenn Adams on the subject.
aa olddanish, oldnorwegian
ch spanish, portuguese, malay, indonesian, polish, oldhausa, slovak, czech,
wendish, catalan, breton, welsh, javanese, bugotu, swahili, zulu, navajo,
choctaw, nahuatl, quechua, guarani, aymara, ido, interlingua
cz polish, oldhungarian
dh albanian, irish
dj indonesian, slovene, javanese
dl zulu, navajo
dz polish, ewe, oldlatvian, navajo
gb yoruba, ewe
hl suto, zulu, chuana, choctaw
kh malay, zulu
lj serbocroatian, slovene, wendish
ll spanish, albanian, catalan, welsh, quechua
ng malay, indonesian, tagalog, visayan, welsh, javanese, maori, bugotu
nj indonesian, serbocroatian, albanian, slovene
ny malay, hungarian, catalan, zulu
ph welsh, zulu, interlingua
qu catalan, interlingua
rj slovene, wendish
rr spanish, albanian
sh malay, oldhausa, albanian, zulu, navajo, choctaw, ido
sz polish, hungarian
th albanian, welsh, bugotu, interlingua
tj indonesian, slovene
tl suto, chuana, nahuatl
ts wendish, ewe, malagash
tsh zulu, navajo
xh albanian, zulu, navajo
Date: 8 Dec 92 18:10:36 PST (Tuesday)
Subject: Spanish letters "ch" and "ll"
From: <Glenn Adams>
To: <Wayne Pollock>
> Date: Sun, 29 Nov 92 19:46:15 EST
> From: <Wayne Pollock>
> I just finished reading Denis Garneau's report, referenced in the Unicode
> standard, on searching and sorting to produce expected results depending
> the culture of the user (i.e., the sort order of the same sequence of
> letters is different if you are Americian or French). And I learned
> something new: that in Spanish there are multi-character letters, namely
> "ch" and "ll". These are apparerently not ligatures but true letters in
> language, and 'ch' would sort differently than the letter 'c' followed by
> an 'h'.
> A quick peruse of the Unicode standard ASCII, Latin, and Extened Latin
> blocks reveals these spanish letters are missing. Using two character
> codes (the 'c' and 'h', or two 'l's) instead of a single code for each of
> these letters doesn't seem to fit with what (little) I know of Unicode
> design; I thought all true letters from any script would merit their own
I just read your question on this topic. If it hasn't already been pointed
out, I might mention that Unicode doesn't necessarily encode the atomic
units of writing systems; rather, it encodes symbols which can be used to
form such units. It also is not the case that 'ch' and 'll' are "letters"
of Spanish writing systems, even though they operate as atomic units for
some collating sequences -- it is not even universally true that Spanish is
collated in this fashion.
The term used by Unicode to describe these units -- 'ch' and 'll' -- is
"text element." The manner in which these elements are interpreted may
depend upon a number of factors, e.g., the operation being performed,
parameters of the operation, the language and orthography represented by the
data, and even the data itself. For example, the English words "cathouse"
and "cathode" treat the sequence 't' 'h' as two units in the first instance
and one unit in the second instance for the hyphenation operation.
Many other writing systems variously treat multiple symbols or forms as
representing one entity at some level of abstraction; for example,
Vietnamese often sorts 'ch', 'gi', 'gh', 'kh', 'ng', 'ngh', 'nh', 'th', and
'tr' as single units (depending on the dictionary).
I hope this helps some. The problem of deciding what is a "letter" is not
always simple. However, in general, combinations of basic symbols such as
those mentioned here are not considered to be "letters" in any analysis.
They are usually called 'digraphs' or 'bigraphemes'; one can also have
trigraphs and n-graphs on the same principle.
By the way, what do you think about 'qu' and 'ch' in English writing
systems. They are true digraphs in English since they always have an atomic
phonological interpretation (although not always the same one, e.g., chin,
chivalry, chiropractor, yacht -- notice that the first three of these occur
in exactly the same context: 'chi-'). Should they have a single character
encoding? If no, then why not?
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT