L2/01-343 From: Sandra O'donnell USG [odonnell@zk3.dec.com] Sent: Thursday, September 13, 2001 3:33 PM Comments on SC22 N3265 = L2/01-282 (European generic locales - Part 2: Narrative cultural specifications, POSIX locales, and repertoiremap) ************************************************************************* Introduction for the benefit of the reader (not part of the US comments) This document includes: * A repertoiremap of European character that probably matches MES-2 (confirmation pending) and that uses the Danish mnemonics for characters (e.g., for E-caron; for Cyrillic A-with-diaresis; for Greek small eta with dasia and varia) and includes ISO/IEC 10646/Unicode identifiers (Uxxxx) as comments). * A generic _EU locale that includes character classification data (upper, lower, punct, etc.), a collation order, numeric formatting, monetary formatting using the euro as the currency symbol, a generic date/time section that uses numbers for all month and day names rather than language-specific strings, and generic yes/no responses ("+" for affirmative, "-" for negative). This file uses the Danish mnemonics only; no Uxxxx identifiers. * A set of 14 country-specific narrative cultural specifications that describe in words the contents of the accompanying POSIX locales. * A set of 14 country-specific POSIX locales. All these locales use the generic _EU definitions for classification, collation, monetary, and numeric information with no modifications. The only locale-specific information is in LC_TIME, which lists language-specific names for month and weekday names, (but defaults to the generic locale for formatting rules), and yes/no responses. ************************************************************************ The following comments refer to the repertoire map first, the _EU locale, and then the country-specific locales. **************IN THE REPERTOIRE MAP: * The repertoire map says that it is MES-2, but there are multiple characters in it that are not in the official definition of MES-2 (CWA 13873...MES-2). They are: U02D6 Modifier Letter Plus Sign U2113 Script Small L U212E Estimated Symbol U2215 Division Slash U2501 Box Drawing Heavy Horizontal U2571 Box Drawing Light Diagonal Upper Right to Lower Left U2572 Box Drawing Light Diagonal Upper Left to Lower Right U25A1 White Square U25AA Black Small Square U25AB White Small Square U25CF Black Circle U25E6 White Bullet U25E2 Black Lower Right Triangle U25E3 Black Lower Left Triangle These should be removed from the repertoire map. * The repertoire map should not use the Danish mnemonics. It should use only the Uxxxx identifiers. This would be consistent with ISO/IEC 14651 and with ISO/IEC 10646. * Near the end of the repertoire map, some characters are repeated, but with different mnemonics. They are: Character 1st mne. 2nd mne. NUMBER SIGN DOLLAR SIGN COMMERCIAL AT <@> (also includes as 3rd mne. CENT SIGN POUND SIGN CURRENCY SIGN YEN SIGN BROKEN BAR SECTION SIGN NOT SIGN <7!> PILCROW SIGN <9I> These should not be repeated. Remove them. * At the very end of the repertoire map, there is a group of box drawing characters. Earlier in the map, a larger group of such characters is defined. At the end, it includes the same subset of characters in range the U2500..U253C as were defined earlier, but here adds U2501. It also adds U2571, U2572, U25E2, and U25E3, and then repeats U266A. As noted previously, some of these characters are not part of the official definition of MES-2 and so should be removed, but it also is confusing that part of the box drawing section is repeated. These characters should only be defined once. Remove the extra definitions. ***************IN THE GENERIC _EU LOCALE: * Multiple mnemonics in the locale do not exist in the repertoire map. Latin letters-with-circumflex have names like in the locale, but in the repertoire map, the naming convention is >. This error exists in all letter-related classes and within the collation definition. Thus: In locale Should be > > > > > > > etc., etc. Not all mnemonics of the form <*//> are wrong. This is the naming convention for letters-with-stroke. Thus, a name like is correct for the Scandanavian Ø (O-stroke). However, the mnemonic appears twice in the upper class; first (incorrectly) in attempting to identify Ô (O-circumflex); second (correctly) meaning Ø (O-stroke). All incorrect mnemonics for letters-with-circumflex in the locale must be fixed. Of course, as noted earlier, the best solution is to use the Uxxxx names to improve consistency with ISO/IEC 14651 and ISO/IEC 10646 rather than these extremely error-prone mnemonics. * There also are errors with the Greek mnemonics not matching the names in the repertoire map. This includes any name that starts with or or ). These probably should not have the slash in them; the probably-matching names in the rep. map are or or . A better solution is to use the Uxxxx names rather than the error-prone mnemonics. * The LC_COLLATE section defines collating symbols .. for use in defining the last character in a group of Latin letters. However, it also uses , but does not define it as one of the collating symbols. * What authorities provided the Greek collation order? * What authorities provided the Cyrillic collation order? * ISO/IEC 14651 lists control characters and ASCII/Latin-1 punctuation first in the common template. The generic _EU locale lists them after all the Latin, Greek, and Cyrillic characters. Although they will sort to the same location, since in both documents they are ignored on the first three passes, it would be clearer to duplicate the 14651 order within the source file. * There is an error in the LC_TIME list for the last month abbreviation: abmon "<0><1>";"<0><2>";"<0><3>";"<0><4>";/ "<0><5>";"<0><6>";"<0><7>";"<0><8>";/ "<0><9>";"<1><0>";"<1><1>";"<2>" That should be <1><2>, not <2>. * The Danish mnemonics in the LC_MESSAGES section is particularly obscure: LC_MESSAGES yesexpr "<<(><+><)/>><.><*>" noexpr "<<(><-><)/>><.><*>" END LC_MESSAGES It would be more helpful for the people trying to read and understand this source file if the Uxxxx identifiers were used and a comment explaining the meaning was added. *****************For the country-specific locales: * All country-specific locales use the base collation definition defined in the _EU locale's LC_COLLATE section. The need for a pan-European collation definition is recognized, and there are no objections to the way it has been defined in the _EU locale. However, it seems quite inappropriate to use the pan-European collation in all of the country-specific locales without tailoring. The LC_COLLATE section collates letters-with-diacritics with the base characters. Thus, letters like å, ø, ä (a-ring, o-stroke, a-diaresis) and others sort with the a's or o's. What Danish user would think it correct to sort æ, ø, and å (ae, o-stroke, and a-ring) with the a's, o's, and a's, respectively, rather than after z, as is the case with Danish? Or how is it useful for Finnish to sort å, ä, and ö (a-ring, a-diaresis, o-diaresis) with the a's and o's, rather than after z? Also, there are no collating elements for the Spanish ch and ll. What Spanish speakers would agree the default collation is correct? Perhaps the argument is that this generic locale is for pan-European support, and each country is giving up a bit of its specific requirements for consistency across Europe. But if that is the case, why are there still language-specific names for months and days in the country-specific locales? For example, what Swedish user will *want* to see Swedish words in a date, want to use non-Swedish rules for å, ä, and ö? Either locales are generic or they aren't. These are a combination of both and will probably cause the most confusion for users. Country-specific locales should be changed to include appropriate tailoring for collation to match language-specific expectations. End of comments