UCA Latin Recommendations

Public Review Issue #47
L2/04-407

UCA Latin Recommendations

Mark Davis, 2004-07-26

The document contains proposals for changes in UCA for the weightings of a small number of Latin characters (out 1870 entries in the UCA for Latin).

It is important that we ensure that the UCA weightings are as good as possible. Collation has application not merely in presenting lists of sorted strings to users, but also in database queries, and in language-sensitive matching. It is crucial, for example, that user expectations are met for how ordering concerns. Consider the example of a German businessman making a database selection, such as to sum up revenue in each of of the cities from O... to P... for planning purposes. If behind his back all cities starting with Ö are excluded because the query selection is using a Swedish collation, there is going to be one very unhappy customer. Similarly, sorting "Søren" after "Sozar" in a long list — if that is not expected in the user's language — will cause problems. A user will look for "Søren"between "Sorem" and "Soret", not see it on the page, and assume it isn't there; fooled by the fact that it is on a completely different page. In matching, the same can occur, which can have cause significant problems for software customers; and as with database selection, the user may not realize what he is missing.

With Unicode being deployed so widely, this is even more important; multilingual data becomes the rule, not the exception. A French company with customers all over Europe is going to have names from many different languages — French, German, Polish, Swedish, etc. If a German employee sets the sorting (or matching/selecting) language to be German, then the names need to show up in the order appropriate for German, even though there will be many different accented characters that do not normally appear in German text.

Stability is important, and we want to consider changes carefully. However, we know that if the UCA is changed in any way, then any tailoring is affected in that it will produce a different ordering for some characters (any that it does not explicitly override). So any implementation's versioning scheme must take account of this. This will always be the case, unless we completely freeze the UCA, disallowing fixes for, say, Indic characters. But the UTC clearly has not agreed to do freeze UCA; while stability is very important, we have left ourselves the ability to make changes in the UCA when warranted.

And the only tailorings that would be affected for the worse are ones where the tailoring depends on inheriting the order from the UCA for the affected characters. In a great many of these cases, the UCA order must be tailored anyway for any of these characters that are needed in the languages. For example, Ø must be tailored for Danish (da.xml). Note that in CLDR, we explicitly do not depend on the UCA ordering for the following characters when they are considered separate letters in the language; for example, in Polish you will see explicit weighting of Ł (pl.xml).

1. Changing Æ to be an expansion

The character Æ (and its lowercase) should sort with a primary weight of AE, just like Œ sorts with a primary weight of OE currently.

Æ	00C6	LATIN CAPITAL LETTER AE
Ǽ	01FC	LATIN CAPITAL LETTER AE WITH ACUTE
Ǣ	01E2	LATIN CAPITAL LETTER AE WITH MACRON

Traditionally, except for a very few languages, Æ is considered to be a presentation variant of AE. You see that in variation in representation between words like hæmophilia and haemophilia, cæsium and caesium. In some languages (or variants, like American English*), the spelling has been reformed to convert to 'e' or another letter. But where the vast majority of people see Æ, they will consider it to behave like AE.

In English, the OED and Webster's Revised both show this Æ behaving as AE: ...Cady, Cæca, Cæcal, Cæcias, Cæcilian, Cæcum, Cænozoic, Caen stone, Cæsar, Cæsarean, Cæsarism, Cæsious, Cæsium, Cæspitose, Cæsura, Cæsural, Café...
In French, this is also the case: we have "Les digrammes soudés (ligatures) comme « æ » et « œ » sont classés avec les lettres doubles correspondantes, en les discriminant toutefois par un indice de priorité particulier, pour assurer la prévisibilité absolue du classement.... cadurcien, cæcum, caennais, cæsium, cafard" — RÈGLES DU CLASSEMENT ALPHABÉTIQUE EN LANGUE FRANÇAISE ET PROCÉDURE INFORMATISÉE POUR LE TRI, as well as major dictionaries.
For German, DIN 5007 requires this behavior: Æ as a secondary difference from AE.
The European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf) require this.
The Scandinavian languages are unaffected by this change, since they have to will tailor Æ to be a letter above z right now anyway.. All other languages would tailor it to be a secondary (or tertiary) difference from ae, to reflect alternate spellings like Cæsar or hæmoglobin.

* Note: of course, there may be more dramatic respellings than the American one, such as: caeisiam, cæsium, cäsiumn,cesi, cesio, césio, cesiom, cesiun, cesium, cesium, césiumm, cesiumn, cesiwm, cesyum, céz, cezm, cezi', cezij, cēzijs, cezio, cezis, cezium, cézium, kaishum, sesín, sesium, seziom, sezyum, siżjum, tseesium, tseziumu, xezi,xêzi, zäsium, zäsiumn, zesioa.

2. Changing characters with diacritics to secondary difference

The following characters (and their lowercases) should be made secondary differences from their bases in UCA 4.1. They are arranged in rough priority order, based on frequency of usage. The UCA should change at least the first group, although all of them are recommended.

			Languages on http://www.eki.ee/letter/
Ø	00D8	LATIN CAPITAL LETTER O WITH STROKE	da [Danish]; fo [Faroese]; kl [Greenlandic]; no [Norwegian];
Ǿ	01FE	LATIN CAPITAL LETTER O WITH STROKE AND ACUTE	(but included for consistency with O WITH STROKE)
Đ	0110	LATIN CAPITAL LETTER D WITH STROKE	bs [Bosnian]; hr [Croatian]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4 [Skolt Sámi]; sl [Slovenian]; vi [Vietnamese];
Ł	0141	LATIN CAPITAL LETTER L WITH STROKE	pl [Polish]; sorb1 [Lower Sorbian]; sorb2 [Upper Sorbian]; sla [Kashubian];
Ŀ	013F	LATIN CAPITAL LETTER L WITH MIDDLE DOT	ca [Catalan];
Ð	00D0	LATIN CAPITAL LETTER ETH	fo [Faroese]; is [Icelandic];
Ħ	0126	LATIN CAPITAL LETTER H WITH STROKE	mt [Maltese];
Ŧ	0166	LATIN CAPITAL LETTER T WITH STROKE	sami2 [North Sámi];
Ǥ	01E4	LATIN CAPITAL LETTER G WITH STROKE	sami2 [North Sámi];
Ŋ	014A	LATIN CAPITAL LETTER ENG	bm [Bambara]; ff [Fula]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4 [Skolt Sámi]; wo [Wolof]; dink [Dinka];
Ɓ	0181	LATIN CAPITAL LETTER B WITH HOOK	ha [Hausa]; ff [Fula]; or bm [Bambara];
Ɗ	018A	LATIN CAPITAL LETTER D WITH HOOK
Ƙ	0198	LATIN CAPITAL LETTER K WITH HOOK
Ɲ	019D	LATIN CAPITAL LETTER N WITH LEFT HOOK
Ƴ	01B3	LATIN CAPITAL LETTER Y WITH HOOK
Ƃ	0182	LATIN CAPITAL LETTER B WITH TOPBAR	No information
Ƈ	0187	LATIN CAPITAL LETTER C WITH HOOK
Ɖ	0189	LATIN CAPITAL LETTER AFRICAN D
Ƒ	0191	LATIN CAPITAL LETTER F WITH HOOK
Ɠ	0193	LATIN CAPITAL LETTER G WITH HOOK
Ɨ	0197	LATIN CAPITAL LETTER I WITH STROKE
Ƞ	0220	LATIN CAPITAL LETTER N WITH LONG RIGHT LEG
Ƥ	01A4	LATIN CAPITAL LETTER P WITH HOOK
Ƭ	01AC	LATIN CAPITAL LETTER T WITH HOOK
Ʈ	01AE	LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
Ʋ	01B2	LATIN CAPITAL LETTER V WITH HOOK
Ƶ	01B5	LATIN CAPITAL LETTER Z WITH STROKE
Ȥ	0224	LATIN CAPITAL LETTER Z WITH HOOK

Users don't distinguish between types of accents. They do not understand why the default ordering of LATIN CAPITAL LETTER I WITH OGONEK makes it sort with I, while the default ordering of LATIN CAPITAL LETTER Z WITH HOOK makes it sort as a completely separate letter than Z.

	Į	012E	LATIN CAPITAL LETTER I WITH OGONEK
	Ȥ	0224	LATIN CAPITAL LETTER Z WITH HOOK

Even where a language distinguishes certain accented letters as separate letters for collation/matching, they expect letters to be treated uniformly. In Polish letters with diacritics Ą Ć Ę Ł Ń Ó Ś Ź Ż are sorted after the corresponding letters without. Querying Polish users, they will expect them either to be all separate letters, or for them all to be sorted with their base: they see no reason for singling out Ł for different treatment than the others.

And if a German customer is accessing a database full of European names, and expects to find Ę with E, and Ą with A and Ż with Z and Ł with L, then he will be right except for the last one with the current UCA. If s/he expects that a database SELECT of all client names starting with "L" will include the "Ł" names also, then s/he will get the wrong answer in a financial report — probably not realizing it is wrong. If s/he looks for a client name Słownik* within a page of Sl... and doesn't think to look 3 pages down after Sz, then s/he will get the wrong answer — probably not realizing it is wrong. If s/he searches for a name within a body of text using a weak language-sensitive match, and doesn't find it, then s/he will get the wrong answer — probably not realizing it is wrong.

Again we see the same pattern of behavior:

In German DIN 5007, it indeed says that letters with diacritics are sorted with the same primary weight (Section 5.1.1.3) and explicitly lists in 6.2.3.1.1 "overstrikes" as being diacritics, and gives Ł as an example of that.
The European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf)
etc.

Q& A

Q. Doesn't this propose to reverse the explicit design principles that went into the default tailorable template in the first place. Similar letters are near — but not interfiled with — similar letters. This is more than enough to give any casual user the functionality he needs, because only in initial position is there likely to be any confusion in real-life sorted word lists.

A. What we actually did was to put similar letters near other letters, and if their decompositions were the
same we interfiled them. To users, however, there is little difference between Å, Ł , Ļ , Ñ, Ø, Ơ, and Ô that would cause a user to think that the some should be interfiled and some should not. Å is seen as a separate letter in the languages that use it, but UCA "interfiles" it. Ł is also seen as a separate letter, and UCA doesn't. In some languages these would be seen as "separate letters" (e.g. with different primary weights) and in others not; but that does not line up in any particular way with what is in the UCA.

And making it a primary vs secondary difference can have some important consequences; not all sorted elements are very small lists, with all affected characters within a few lines of each other on a single page, where placement doesn't matter too much. This doesn't work with large lists, database selection, matching (where I won't see that I am missing something), etc.

Q. O-slash is treated as a separate letter in the pronunciation guides of all IPA-based dictionaries, which constitute the majority of the world's usage, currently. So shouldn't it be left as a "separate letter"?

A. First, we don't know that UCA out of the box sorts IPA correctly — nor do we have much of an idea what constitutes the "correct" IPA sorting. The IPA specification itself does not appear to have any sorting requirement. Secondly, even in dictionaries, the entries are not normally sorted by the IPA, they are sorted by the words that the IPA is glossing. Thirdly — and much more importantly — the amount of sorted IPA data is going to be dwarfed by the amount of data sorted according to normal language conventions.

The fact that IPA uses these letters as being different is completely aside from the point. Everyone agrees that for that purpose they are different characters: Å and A are different characters, but interleaved in UCA; Ł and L are different characters, but not interleaved in UCA.

Q. Won't this produce a visually disturbing effect, as in the following?

	Interleaved (Recommendation)	Separate (Current UCA)
1	ofofofo oføfofo øfofofo øfoføfø ofofofp	ofofofo ofofofp oføfofo øfofofo øfoføfø

This is an curious perception, since this is only one case out of 102 accented o's, where all the others are interleaved. And of course visual disturbance of multiple characters in such artificial examples with multiple marks has little to do with sorting/matching behavior.

	Interleaved (Current UCA)
2	ofofofo ofơfofo ơfofofo ơfofơfơ ofofofp
3	ofofofo ofõfofo õfofofo õfofõfõ ofofofp
	...

Public Review Issue #47 L2/04-407