L2/01-343

From: Sandra O'donnell USG [odonnell@zk3.dec.com]
Sent: Thursday, September 13, 2001 3:33 PM

Comments on SC22 N3265 = L2/01-282 (European generic locales - Part 2: 
Narrative cultural specifications, POSIX locales, and repertoiremap)

*************************************************************************

Introduction for the benefit of the reader (not part of the US comments)

This document includes:

*  A repertoiremap of European character that probably matches MES-2
(confirmation pending) and that uses the Danish mnemonics for
characters (e.g., <E<> for E-caron; <A=:> for Cyrillic A-with-diaresis;
<y*;!> for Greek small eta with dasia and varia) and includes
ISO/IEC 10646/Unicode identifiers (Uxxxx) as comments).

*  A generic _EU locale that includes character classification data
(upper, lower, punct, etc.), a collation order, numeric formatting,
monetary formatting using the euro as the currency symbol, a generic
date/time section that uses numbers for all month and day names rather
than language-specific strings, and generic yes/no responses ("+" for
affirmative, "-" for negative). This file uses the Danish mnemonics
only; no Uxxxx identifiers.

*  A set of 14 country-specific narrative cultural specifications
that describe in words the contents of the accompanying POSIX locales.

*  A set of 14 country-specific POSIX locales. All these locales use
the generic _EU definitions for classification, collation, monetary,
and numeric information with no modifications. The only locale-specific
information is in LC_TIME, which lists language-specific names for
month and weekday names, (but defaults to the generic locale for
formatting rules), and yes/no responses.

************************************************************************

The following comments refer to the repertoire map first, the
_EU locale, and then the country-specific locales.

**************IN THE REPERTOIRE MAP:

*  The repertoire map says that it is MES-2, but there are multiple
characters in it that are not in the official definition of MES-2
(CWA 13873...MES-2). They are:

U02D6	Modifier Letter Plus Sign
U2113	Script Small L
U212E	Estimated Symbol
U2215	Division Slash
U2501	Box Drawing Heavy Horizontal
U2571	Box Drawing Light Diagonal Upper Right to Lower Left
U2572	Box Drawing Light Diagonal Upper Left to Lower Right
U25A1	White Square
U25AA	Black Small Square
U25AB	White Small Square
U25CF	Black Circle
U25E6	White Bullet
U25E2	Black Lower Right Triangle
U25E3	Black Lower Left Triangle

These should be removed from the repertoire map.

*  The repertoire map should not use the Danish mnemonics. It should
use only the Uxxxx identifiers. This would be consistent with ISO/IEC 14651
and with ISO/IEC 10646.

*  Near the end of the repertoire map, some characters are repeated,
but with different mnemonics. They are:

Character        1st mne.     2nd mne.
NUMBER SIGN	 <Nb>         <H->
DOLLAR SIGN      <DO>         <!S>
COMMERCIAL AT    <At>         <@>        (also includes <Oa> as 3rd mne.
CENT SIGN        <Ct>         <!C>
POUND SIGN       <Pd>         <L->
CURRENCY SIGN    <Cu>         <Xo>
YEN SIGN         <Ye>         <Y->
BROKEN BAR       <BB>         <!B>
SECTION SIGN     <SE>         <So>
NOT SIGN         <NO>         <7!>
PILCROW SIGN     <PI>         <9I>

These should not be repeated. Remove them.

*  At the very end of the repertoire map, there is a group of box
drawing characters. Earlier in the map, a larger group of such
characters is defined. At the end, it includes the same subset of
characters in range the U2500..U253C as were defined earlier, but here
adds U2501. It also adds U2571, U2572, U25E2, and U25E3, and then repeats
U266A. As noted previously, some of these characters are not part of the
official definition of MES-2 and so should be removed, but it also is
confusing that part of the box drawing section is repeated. These
characters should only be defined once. Remove the extra definitions.


***************IN THE GENERIC _EU LOCALE:

*  Multiple mnemonics in the locale do not exist in the repertoire map.
Latin letters-with-circumflex have names like <A//> in the locale, but
in the repertoire map, the naming convention is <A/>>. This error
exists in all letter-related classes and within the collation
definition. Thus:

In locale               Should be
<A//>                   <A/>> 
<E//>                   <E/>> 
<I//>                   <I/>> 
<U//>                   <U/>> 
<C//>                   <C/>> 
<u//>                   <u/>> 
<c//>                   <c/>> 
etc., etc.

Not all mnemonics of the form <*//> are wrong. This is the naming
convention for letters-with-stroke. Thus, a name like <O//> is
correct for the Scandanavian Ø (O-stroke). However, the mnemonic <O//>
appears twice in the upper class; first (incorrectly) in attempting
to identify Ô (O-circumflex); second (correctly) meaning Ø (O-stroke).

All incorrect mnemonics for letters-with-circumflex in the locale
must be fixed. Of course, as noted earlier, the best solution is to
use the Uxxxx names to improve consistency with ISO/IEC 14651 and
ISO/IEC 10646 rather than these extremely error-prone mnemonics.

*  There also are errors with the Greek mnemonics not matching the names
in the repertoire map. This includes any name that starts with <A*/ or <W*/
or <Y*/ (e.g., <A*/;!J> or <W*/;J> or <Y*/;?J>). These probably should not
have the slash in them; the  probably-matching names in the rep. map are
<A*;!J> or <W*;J> or <Y*;?J>.  A better solution is to use the Uxxxx names
rather than the error-prone mnemonics.

*  The LC_COLLATE section defines collating symbols <a8>..<z8> for use in
defining the last character in a group of Latin letters. However, it also
uses <th8>, but does not define it as one of the collating symbols.

*  What authorities provided the Greek collation order?

*  What authorities provided the Cyrillic collation order?

*  ISO/IEC 14651 lists control characters and ASCII/Latin-1 punctuation
first in the common template. The generic _EU locale lists them after all
the Latin, Greek, and Cyrillic characters. Although they will sort to the
same location, since in both documents they are ignored on the first three
passes, it would be clearer to duplicate the 14651 order within the source
file.

*  There is an error in the LC_TIME list for the last month abbreviation:

   abmon "<0><1>";"<0><2>";"<0><3>";"<0><4>";/
	    "<0><5>";"<0><6>";"<0><7>";"<0><8>";/
	    "<0><9>";"<1><0>";"<1><1>";"<D><2>"

That should be <1><2>, not <D><2>.

*  The Danish mnemonics in the LC_MESSAGES section is particularly
obscure:

   LC_MESSAGES
   yesexpr "<<(><+><)/>><.><*>"
   noexpr "<<(><-><)/>><.><*>"
   END LC_MESSAGES

It would be more helpful for the people trying to read and understand
this source file if the Uxxxx identifiers were used and a comment
explaining the meaning was added.

*****************For the country-specific locales:

* All country-specific locales use the base collation definition defined
in the _EU locale's LC_COLLATE section. The need for a pan-European
collation definition is recognized, and there are no objections to the
way it has been defined in the _EU locale. However, it seems quite
inappropriate to use the pan-European collation in all of the
country-specific locales without tailoring.

The LC_COLLATE section collates letters-with-diacritics with the base
characters. Thus, letters like å, ø, ä (a-ring, o-stroke, a-diaresis)
and others sort with the a's or o's. What Danish user would think it
correct to sort æ, ø, and å (ae, o-stroke, and a-ring) with the a's,
o's, and a's, respectively, rather than after z, as is the case with Danish?
Or how is it useful for Finnish to sort å, ä, and ö (a-ring,
a-diaresis, o-diaresis) with the a's and o's, rather than after z? Also,
there are no collating elements for the Spanish ch and ll. What Spanish
speakers would agree the default collation is correct? 

Perhaps the argument is that this generic locale is for pan-European support,
and each country is giving up a bit of its specific requirements for 
consistency across Europe. But if that is the case, why are there still
language-specific names for months and days in the country-specific locales?
For example, what Swedish user will *want* to see Swedish words in a date,
want to use non-Swedish rules for å, ä, and ö? Either locales are
generic or they aren't. These are a combination of both and will probably
cause the most confusion for users.

Country-specific locales should be changed to include appropriate tailoring
for collation to match language-specific expectations. 


End of comments