L2/04-031R

Re: UCA Revised Latin?
From: Mark Davis
Date: 2004-06-17

We should consider whether or not to do the following changes to the next version of the UCA.

1. Make alternate forms of letters be secondary differences from the 'base' letter. For example, the following would all be primary equivalents, and only differ on the secondary level.

Pros:
  1. If a language does not use those letters, they would be expected to be ordered as variants of a base. For example, a non-Scandinavian user would expect to see ø as a variant of o, and not have the ordering:
    1. sos...
    2. sot...
    3. sou...
    4. søs...
  2. If a language does use those letters, they are very likely tailored someplace else anyway.
  3. When a tailoring inserts letters, it is typically after the base. Suppose for example, that a language sorts t as primary-greater than d. Without special consideration for the variant forms, what a user would see is:
    1. sod...
    2. sot...
    3. sođ...
    Instead of what the user would expect:
    1. sod...
    2. sođ...
    3. sot...
  4. Better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf), for letters that are in the repertoire
Cons:
  1. stability -- not a small con, so we need to consider it carefully!

2. Make "æ" be a secondary difference from "ae".

Pros:
  1. consistency with the handling of "œ"
  2. currently all Latin languages have to tailor this character. Certain Scandinavian languages will tailor it to be a letter above z. All other languages would tailor it to be a secondary (or tertiary) difference from ae, to reflect alternate spellings like Cæsar or hæmoglobin.
  3. better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf
Cons:
  1. stability

For reference, here is an email related to the topic.

> ----- Original Message -----
> From: Åke Persson
> To: Mark Davis
> Sent: Wed, 2003 Dec 31 06:36
> Subject: ae << æ etc.
>
> Mark,
>
> I have browsed the latest ICU collations. Here are a few comments.
>
> The inclusion of ae << æ in several languages resembles my experience when I
> implemented the UCA in Mimer SQL. The next thing that came up was letters with
> stroke. For example, the Polish letter L-stroke, properly used in Polish names,
> did not match a Swedish or English search for names containing L. L-stoke is
> expected to be L with a stroke "accent", except for Polish (and Sorbian).
> <<Lodz.jpg>> is a snapshot from a Swedish encyclopædia (note also "oe"). To make
> a long story short, it all ended up in the European Ordering Rules (EOR)
> concept, where the base letters in the latin alphabet are only A-Z. The first
> step was to create an EOR-tailoring as the base. Languages, with additional
> letters in their alphabet, was tailored on top of the EOR tailoring. The next
> step was improvement of space and performance, by making EOR the default, and to
> create a tailoring for the default UCA instead (at least needed for the
> conformance test).
>
> Here's an overview of the tailorings:
> http://developer.mimer.com/collations/charts/tailorings.htm
>
> Please, take a closer look at:
> Catalan, Croatian, Faroese, Icelandic, Latvian, Lithuanian, Romanian, and Slovak
> compared to the corresponding ICU collations.
>
> My sources are documented here:
> http://developer.mimer.com/collations/charts/sources.htm
>
> The E-ogonek (old Sami and Icelandic Ä) as a variant of Ä in Faroese, Finnish,
> Greenlandic, Norwegian, and Swedish looks a bit goofy. I would rather expect a
> search match for E in Polish and Lithuanian names containing E-ogonek. I think
> it's better to have a specific locale for Sami.
>
> [before 1] is used extensively in the ICU collations. It's easier to read the
> collation definitions, if [before 1] is used only when necessary.
>
> Happy New Year!
> Åke Persson


Here are the Latin primary-different characters, for comparison (all non-primary-different characters have been suppressed in this list).