RE: UTF-8 and POSIX locales

From: Chris Pratley (chrispr@microsoft.com)
Date: Fri Jun 25 1999 - 00:43:50 EDT


   . POSIX has
   nothing for measurement units, paper sizes, or terminology like
   "ZIP Code". You may be confusing POSIX with Microsoft locales. I
   believe (but am not positive) that they include some of the items
   you're lamenting about.

In reference to Sandra's comment about Microsoft locales: On Windows systems
there is a difference between the user's locale and the system locale. No
encoding is associated with the user's locale, which controls such things as
date formats and currencies. There is an encoding associated with each
system locale, but (on NT) this is for backwards compatibility with existing
"ANSI" (non-Unicode) applications currently used in those markets. An
application developer is not required to use this encoding, since all Win32
APIs exist in Unicode forms as well. So the developer can use any encoding
they like - especially if they base their own code on Unicode and translate
to whatever legacy encoding they prefer (using available system services
such as WideChartoMultiByte). The same is possible on Win9x, although you
have to write a little more code if you want Unicode text display, and
certain things are not possible, such as Unicode filenames. "en_ US.UTF-8"
is not related to any Microsoft locale naming system.

Chris Pratley
Lead Program Manager
Microsoft Office

-----Original Message-----
From: Sandra O'donnell USG [mailto:odonnell@zk3.dec.com]
Sent: June 24, 1999 6:35 AM
To: Unicode List
Cc: Markus Kuhn; odonnell@zk3.dec.com
Subject: Re: UTF-8 and POSIX locales

   What I do not like at the moment about the locale mechanism is that it
   ties together the character encoding and the cultural conventions, which
   I think are two completely orthogonal things. Even worse, there are some
   systems out there that provide UTF-8 locales only tied to a cultural
   convention set, in the worst case they provide only en_US.UTF-8. So to
   get UTF-8, I would also have to accept the strange US date/time
   notations and in some programs even default settings for non-metric
   units and strange US paper sizes, US-specific terminology such as "ZIP
   code" instead of "postal code", etc., all derived from the "en_US" part
   of the locale name.

Your subject line says you're talking about POSIX, but most of
the examples you list here are not in POSIX locales. POSIX has
nothing for measurement units, paper sizes, or terminology like
"ZIP Code". You may be confusing POSIX with Microsoft locales. I
believe (but am not positive) that they include some of the items
you're lamenting about.

POSIX does include definitions for date/time formatting. BTW, you
describing the US conventions as "strange." Do you think it's
appropriate for others to describe German or European conventions
as "strange?" The reality of i18n is that it allows us to support
different cultural and linguistic conventions. Not the "strange"
ones and the "right" ones.

Now...you also lament that POSIX ties together the character encoding
and the cultural conventions. I disagree. There is a charmap, which
defines the character encoding, and a separate localedef source file,
which defines how abstract characters are handled for a given locale.
You combine a charmap and a localedef to get a locale, so the two are
tied together at some point, but you could build three different locales
using one localedef and three different charmaps.

And you also lament about the systems that provide "UTF-8 locales only
tied to a cultural convention set." Ummmm, a locale has to include
SOME cultural conventions, or it isn't a locale, according to POSIX.
What would a UTF-8 locale that is independent of cultural conventions
look like?
  
   The standard should specify the name of some default locales that do
   specify an encoding, but that otherwise copy just the cultural
   conventions of the C or POSIX default locales.

Given that you complained about the "strange" US conventions, I'm
surprised you'd want a UTF-8 locale that uses the C/POSIX cultural
conventions. Those basically are what are in most en_US locales.

Why are the US conventions okay here, but not earlier?
  
   I also like the idea of a standard locale named "ISO.UTF-8" or
   "international" that uses UTF-8 and fills in other cultural conventions
   according to ISO standards, e.g. ISO 8601 for the date/time notation
   and ISO 31 for the formatting of monetary units (currency appended with
   a space behind the number, just like any SI unit).
  
Fine, build one like that. I think most people want something that
matches their cultural conventions more closely, but there's nothing
to prevent you or anyone else from building what you describe.
  
                -- Sandra
-----------------------
Sandra Martin O'Donnell
Compaq Computer Corporation
sandra.odonnell@compaq.com
odonnell@zk3.dec.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT