From: Doug Ewell (firstname.lastname@example.org)
Date: Mon Mar 07 2005 - 23:47:20 CST
Jukka K. Korpela <jkorpela at cs dot tut dot fi> wrote:
> Unfortunately many standards and recommendations that prescribe the
> use of special characters do not identify them by Unicode numbers or
> names or in any other unique manner.
The UN/LOCODE standard, which assigns an alphabetic code to 44,000
locations worldwide, seems particularly confused in this regard.
The Secretariat's Notes state that "UN/LOCODE... is produced mainly
using the United States character set 437," but this is not true since
the data includes characters that are not present in MS-DOS code page
437 (such as ã) and even some that are not in ISO 8859-1 (such as š).
In fact, the data is in Windows code page 1252.
The Notes go on to state that "the sorting order follows the one
specified for that character set [CP437]," but this is also not true,
judging from the order in which these three locations in Argentina are
Here the o-with-acute is simply folded to o. Other entries, in Angola
for example, show that apostrophes are ignored in sorting. This is
probably good sorting practice, but contradicts the claim that straight
code point order is used (CP437 or otherwise).
Apparently the presence of non-ASCII characters in the names of
worldwide locations has proved to be a noticeable problem for some
users, because each name is presented twice, once spelled correctly and
once "without diacritics." I don't know if "with diacritics" is the
correct term for a character like Ø, but that it what they call it.
The Notes state, "International ISO Standard character sets are laid
down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United
States character set (437), which conforms to these ISO standards, is
also widely used in trade data interchange)." I wonder how much data is
still interchanged in MS-DOS code pages. In any case, the mention of
ISO 10646 (an antique version, at that) seems curious since that
standard is not used.
UN/LOCODE could be improved by removing the references to CP437,
especially the one calling it "the standard United States character
set," and by correcting other errors related to characters and sorting.
They might even consider converting it to UTF-8 or another Unicode
I've been thinking about communicating this to the UN/ECE Secretariat,
but it seems I just did.
This archive was generated by hypermail 2.1.5 : Mon Mar 07 2005 - 23:49:44 CST