Re: UTF-8 and POSIX locales

From: Sandra O'donnell USG (
Date: Thu Jun 24 1999 - 09:39:31 EDT

   What I do not like at the moment about the locale mechanism is that it
   ties together the character encoding and the cultural conventions, which
   I think are two completely orthogonal things. Even worse, there are some
   systems out there that provide UTF-8 locales only tied to a cultural
   convention set, in the worst case they provide only en_US.UTF-8. So to
   get UTF-8, I would also have to accept the strange US date/time
   notations and in some programs even default settings for non-metric
   units and strange US paper sizes, US-specific terminology such as "ZIP
   code" instead of "postal code", etc., all derived from the "en_US" part
   of the locale name.

Your subject line says you're talking about POSIX, but most of
the examples you list here are not in POSIX locales. POSIX has
nothing for measurement units, paper sizes, or terminology like
"ZIP Code". You may be confusing POSIX with Microsoft locales. I
believe (but am not positive) that they include some of the items
you're lamenting about.

POSIX does include definitions for date/time formatting. BTW, you
describing the US conventions as "strange." Do you think it's
appropriate for others to describe German or European conventions
as "strange?" The reality of i18n is that it allows us to support
different cultural and linguistic conventions. Not the "strange"
ones and the "right" ones. also lament that POSIX ties together the character encoding
and the cultural conventions. I disagree. There is a charmap, which
defines the character encoding, and a separate localedef source file,
which defines how abstract characters are handled for a given locale.
You combine a charmap and a localedef to get a locale, so the two are
tied together at some point, but you could build three different locales
using one localedef and three different charmaps.

And you also lament about the systems that provide "UTF-8 locales only
tied to a cultural convention set." Ummmm, a locale has to include
SOME cultural conventions, or it isn't a locale, according to POSIX.
What would a UTF-8 locale that is independent of cultural conventions
look like?
   The standard should specify the name of some default locales that do
   specify an encoding, but that otherwise copy just the cultural
   conventions of the C or POSIX default locales.

Given that you complained about the "strange" US conventions, I'm
surprised you'd want a UTF-8 locale that uses the C/POSIX cultural
conventions. Those basically are what are in most en_US locales.

Why are the US conventions okay here, but not earlier?
   I also like the idea of a standard locale named "ISO.UTF-8" or
   "international" that uses UTF-8 and fills in other cultural conventions
   according to ISO standards, e.g. ISO 8601 for the date/time notation
   and ISO 31 for the formatting of monetary units (currency appended with
   a space behind the number, just like any SI unit).
Fine, build one like that. I think most people want something that
matches their cultural conventions more closely, but there's nothing
to prevent you or anyone else from building what you describe.
                -- Sandra
Sandra Martin O'Donnell
Compaq Computer Corporation

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT