Re: LC_CTYPE locale category and character sets.

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 16 1998 - 15:34:28 EDT


Keld notes:

>
> Christophe PIERRET writes:
>
> > Here are some questions regarding character properties and cultural
> > preferences:
> >
> > * Does the character properties defined in a LC_CTYPE posix locale
> > category
> > depends only on the character set of the locale ?
>
> In principle not, in practice possibly. It is advocated that
> all character properties stay the same across character sets
> and language/country/culture.

And this is as it should be.

> But in a culture there may be
> specific recommendations on what is considered eg. a letter, a digit,
> or a punctuation mark. In some cultures eg devanagari digits
> are recognised as digits, while in others these may just be
> considered some kind of strange special character.

Ignorance in one culture of cultural practices (and character
usage) in another is not the basis for assignment of character
properties in the Universal Character Set. Devanagari digits
are exactly that: Devanagari digits. If somebody, somewhere, turns
up some actual written practice where characters that look like
Devanagari digits are actually being used, for example, as letters
of an alphabet instead of as digits, then that is a reason for encoding
separate characters, not for changing properties in a LC_CTYPE definition.

By the way, just such a case in under consideration in WG2 right
now--where the Thaana script contains many letters whose shapes are
directly based on the series of Arabic digits.

> Also for
> punctuation marks, eg quotation marks vary widely from culture
> to culture.

Once again, variation is usage of quotation marks from one typographical
tradition to another does not change the underlying properties of
the characters which are encoded in the UCS as quotation marks.

See <http://www.unicode.org/unicode/uni2errate/QuoteErrata.html>
for the latest statement from UTC about the properties of quotation
marks, as well as information about culturally-specific usage that
will be added to the text of the Unicode Standard in the future.

>
> > I can't imagine that LATIN CAPITAL LETTER A is not uppercase anymore !
> >
> > But are there any known example of a LC_CTYPE character property
> > (isalpha, isupper, tolower, isdigit, isxdigit ...)
> > which changes or should change from one culture to another ?
>
> isupper/islower for Turkish is a prime example.

I think Keld meant toupper/tolower for Turkish. The difference
is not in the case status of the characters, but in the mapping
from lowercase to uppercase and vice versa.

> Uppercase of initial "ij" in Dutch (becomes both uppercase)
> is another.

This is another good example of language-specific case-mapping.

--Ken

>
> Keld Simonsen
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT