Re: LC_CTYPE locale category and character sets.

Date: Tue Jul 21 1998 - 15:01:38 EDT

I've been out of the office, so I'm late to this discussion
and see that by now it has wandered almost completely off
the original subject. I have learned a lot about quotation
mark usage, German spelling changes (or not!), and the
way certain words translate. Spellbinding as all this
trivia is, I'd like to go back to some of the original

Christophe Pierret asked and Ken Whistler answered:

> Here are some questions regarding character properties and cultural
> preferences:
> * Does the character properties defined in a LC_CTYPE posix locale
> category
> depends only on the character set of the locale ?
   This is one of the issues driving the critique of the proposed
   ISO standard 14652, which attempts to expand POSIX locale constructs
   to cover 10646/Unicode.
   From the point of view of the universal character set (UCS), i.e.
   the Unicode Standard, character properties are properties of the
   characters. They are not locale-specific, but universal.
> * Is it meaningful to consider that a unicode (considered as a character
> set) LC_CTYPE
> locale category doesn't change with the cultural preferences ?
   Case-mappings between characters have a few well-known, culturally-specific
   preferences that must be accounted for. But case-mappings are *relations*
   between pairs (or triplets) of characters, and not character properties
   per se. The character properties themselves should be invariant, defined
   on the universal character set.
   Then against the background of that set of invariant character properties,
   engineers can do a better job of adjusting the kinds of behavior in
   software which *should* be culturally-specific and vary by locale.
> I can't imagine that LATIN CAPITAL LETTER A is not uppercase anymore !
   Nor can I. This is one of the reasons why it is meaningless to define
   an isupper class in an LC_CTYPE definition.
   LC_CTYPE was, in my opinion, basically a kludge to get around the fact
   that different (non-universal) character sets contained different
   repertoires, differently encoded. The use of LC_CTYPE enabled those
   differences to be encapsulated in the equivalent of locale-specific
   resource files in such a way that it basically allowed the API level
   isupper(), etc., to work in a locale- and character-set-independent
   But such considerations are obsolete for Unicode-based implementations.
> But are there any known example of a LC_CTYPE character property
> (isalpha, isupper, tolower, isdigit, isxdigit ...)
> which changes or should change from one culture to another ?
   None of them should.

Ken answered quite clearly from the Unicode point of view, but
there is a large body of existing practice that implements things
a little differently. Christophe originally asked whether the
character properties in a locale depend on the character set
and Ken rightly answered that from Unicode's point of view,
such properties are universal rather than character set-specific.

But POSIX and Unicode are two different things.

In POSIX syntax, you combine a character map (charmap) with a
locale definition (localedef) to create a locale object. In most
practice, the characters that the localedef includes are also the
characters in the charmap, so if the charmap is for something like
ISO 8859-1 (Latin-1), the localedef will only refer to those
characters in the Latin-1 repertoire. Thus, while in Unicode
an <a-breve> is always a letter, <a-breve> is not part of the
Latin-1 repertoire, and so it typically does not appear in a
localedef that will be build with a Latin-1 charmap. It CAN
appear -- having a character in a localedef that does not
appear in the charmap typically causes a warning rather than an
error -- but most localedefs I've seen are pretty tightly coupled
to the charmap with which they will be built.

Ken noted that the POSIX syntax allowed for the varying repertoires
of other character sets, but that with Unicode, these considerations
are obsolete. True, but there are many companies that support both
Unicode and other encodings. Regardless of whether these companies
prefer the Unicode model or something more variable, they must
support both.

Christophe said he couldn't imagine that LATIN CAPITAL LETTER A
wouldn't be uppercase in some locales. As it turns out, neither
can POSIX. It requires that a small group of characters (think
ASCII, though the standards police will yell at me) be present
in *all* locales and that they have the same
properties/classifications in each. LATIN CAPITAL LETTER A is
one that must appear in all locales and it is always an
uppercase letter. Characters beyond ASCII, however, may or
may not appear in a given locale, and it is theoretically
possible that <a-acute> could appear in the <punct> class in
one locale. I'm not aware of this happening in actual practice.

Christophe also asked whether there is a property that changes
from one culture to another -- i.e., a character that is <alpha>
in one culture and <digit> in another. I can't think of any
characters like that. Rather, what tends to happen in POSIX
locales is that one locale will include a character in a
class, while another locale won't. So <a_acute> might be in
the <alpha> class in my "foo" locale, but not in my "bar" locale.
But <a_acute> won't be <alpha> in "foo" and <digit> in "bar."

                -- Sandra
Sandra Martin O'Donnell

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT