L2/04-031

Re: UCA Revised Latin?
From: Mark Davis
Date: 2004-01-23
 

We should consider whether or not to do the following changes to the next version of the UCA.

[For the meeting, please also print http://www.unicode.org/charts/collation/chart_Latin.html]

1. Make alternate forms of letters (like the following) be secondary differences from the 'base' letter.

a ɐ
0250
ɑ
0251
ɒ
0252
b ʙ
0299
ƀ
0180
ɓ
0253
Ɓ
0181
ƃ
0183
Ƃ
0182
c ƈ
0188
Ƈ
0187
ɕ
0255
d đ
0111
Đ
0110
ɖ
0256
Ɖ
0189
ɗ
0257
Ɗ
018A
ƌ
018C
Ƌ
018B
ð
00F0
Ð
00D0
ƍ
018D
etc.

Outliers: the following appear unrelated to the 'base' letter that they are after (in UCA order), so should be left where they are.

Ƣ
01A2
ƣ
01A3
ɤ
0264
etc.

2. Make "æ" be a secondary difference from "ae".

  1. Pros:
    1. consistency with the handling of "œ"
    2. currently all Latin languages have to tailor this character. Certain Scandinavian languages will tailor it to be a letter above z. All other languages would tailor it to be a secondary (or tertiary) difference from ae, to reflect alternate spellings like Cæsar or hæmoglobin.
    3. better compatibility with the European ordering rules (http://anubis.dkuug.dk/CEN/TC304/EOR/eor4r.pdf
  2. Cons:
    1. stability

For reference, here is an email related to the topic.

> ----- Original Message -----
> From: Åke Persson
> To: Mark Davis
> Sent: Wed, 2003 Dec 31 06:36
> Subject: ae << æ etc.
>
> Mark,
>
> I have browsed the latest ICU collations. Here are a few comments.
>
> The inclusion of ae << æ in several languages resembles my experience when I
> implemented the UCA in Mimer SQL. The next thing that came up was letters with
> stroke. For example, the Polish letter L-stroke, properly used in Polish names,
> did not match a Swedish or English search for names containing L. L-stoke is
> expected to be L with a stroke "accent", except for Polish (and Sorbian).
> <<Lodz.jpg>> is a snapshot from a Swedish encyclopædia (note also "oe"). To make
> a long story short, it all ended up in the European Ordering Rules (EOR)
> concept, where the base letters in the latin alphabet are only A-Z. The first
> step was to create an EOR-tailoring as the base. Languages, with additional
> letters in their alphabet, was tailored on top of the EOR tailoring. The next
> step was improvement of space and performance, by making EOR the default, and to
> create a tailoring for the default UCA instead (at least needed for the
> conformance test).
>
> Here's an overview of the tailorings:
> http://developer.mimer.com/collations/charts/tailorings.htm
>
> Please, take a closer look at:
> Catalan, Croatian, Faroese, Icelandic, Latvian, Lithuanian, Romanian, and Slovak
> compared to the corresponding ICU collations.
>
> My sources are documented here:
> http://developer.mimer.com/collations/charts/sources.htm
>
> The E-ogonek (old Sami and Icelandic Ä) as a variant of Ä in Faroese, Finnish,
> Greenlandic, Norwegian, and Swedish looks a bit goofy. I would rather expect a
> search match for E in Polish and Lithuanian names containing E-ogonek. I think
> it's better to have a specific locale for Sami.
>
> [before 1] is used extensivly in the ICU collations. It's easier to read the
> collation definitions, if [before 1] is used only when necessary.
>
> Happy New Year!
> Åke Persson