Re: UTF-8 and POSIX locales

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Jun 25 1999 - 05:41:21 EDT


"Sandra O'donnell USG" wrote on 1999-06-24 13:35 UTC:
> Your subject line says you're talking about POSIX, but most of
> the examples you list here are not in POSIX locales. POSIX has
> nothing for measurement units, paper sizes, or terminology like
> "ZIP Code".

The strings in the locale environment variables can also be evaluated by
software to make educated guesses of what the appropriate cultural
conventions are for issues not covered by POSIX locales. For example, if
you have a word processor, it is perfectly reasonable to do make the
paper size by default ISO A4 (used by all except two countries on this
planet), and if LANG contains the string "_US" or "_CA" then make the
default US-Letter instead. In both cases, this should affect only the
default value that is used as long as the user hasn't specifically
selected a paper size. Such behaviour (evaluating LANG or LC_CTYPE to
guess what appropriate default settings might be) is not uncommon, at
least for European POSIX applications. It has been my sad experience
over the years that US manufacturers have a tendency of believing that
US conventions are a sufficiently good default values for everyone in
the world, which is why in spite of all i18n, we still have to manually
set Letter->A4, inches->mm, etc.

Some of the examples that I used (paper size) are actually being
included into POSIX via ISO 14652, but that was not primarily what I was
talking about.

> BTW, you
> describing the US conventions as "strange." Do you think it's
> appropriate for others to describe German or European conventions
> as "strange?" The reality of i18n is that it allows us to support
> different cultural and linguistic conventions. Not the "strange"
> ones and the "right" ones.

Oh yes:

#define FLAME_MODE on

If software insists to label the hours of the day by default with

12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
\__________________A________________/ \_________________P_________________/

just in order to avoid anyone getting the impression that there is an
unlucky 13th hour of the day (see my recent essay here on the set of
supernatural numbers used by US hotels and air lines), then I will
continue to call this a really strange convention compared to simply
counting them from 0 to 23. Similarly the date format in month-day-year
order, which is neither big-endian nor little-endian. Or did you
recently convert gallons into cubic inches? It is actually surprising
that the US don't use Romal numerals that much any more, a frightening
spread of modern conventions I'd say. Not only Europe got rid of all
this medieval stuff over the last century, and I encourage US software
developers to encourage their users to do the same. US conventions are
not related to any specific local culture, they are just medieval and
antique European practice that the cultures who came up with them
originally have long ago abandoned.

I believe that I prefer European conventions not just because I grew up
there, but because European conventions evolved in an environment of
intensive inter-cultural trade and communication. Europeans have
developed a tendency to quickly replace their old conventions with new
more practical and more efficient ones, as soon as they spot
international incompatibilities. In the US on the other hand, for not
entirely clear reasons, the oldest and most bizarre conventions around
seem to be preferred whenever there are several alternatives to choose
from. In some issues (e.g., numeric date notation) I am perfectly happy
to advocate the much more logical Asian bigendian tradition (now ISO
8601) over the old European dd.mm.yyyy for instance. I am also happy to
advocate the use of the English decimal dot over the Continental decimal
comma, because the comma is already frequently used to separate the
items in a list in sentences. I do prefer certain conventions, not
because they are the ones I grew up with, but because there are
objective criteria that make some more practical and useful then
others.

#define FLAME_MODE off

> Given that you complained about the "strange" US conventions, I'm
> surprised you'd want a UTF-8 locale that uses the C/POSIX cultural
> conventions. Those basically are what are in most en_US locales.

So why not call them POSIX.UTF-8 locales? This would make me a bit more
comfortable, because I do associate en_US locales with bizarre
conventions like using 12a as the first hour of the day, etc., and I'd
like to stay away from those.

In addition, the name "en_US" in these locales might lead some software
to believe that I use 279x216 mm paper, that my phone numbers are always
10 digits long, that my postal addresses contain a state field, that I
feel better after reading 20 pages of Legalese licence contracts before
I can start using my software, etc., that my encryption and passwords
keys have to be automatically deposited with the FBI, that my mailboxes
look like miniature aircraft hangars, etc.

(I don't want to offend anyone with the above. Just take my opinions as
example teaching material for i18n engineering courses in the chapter
"Crazy Europeans" ... :-)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT