Re: double hyphen

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Mar 07 2005 - 23:47:20 CST

  • Next message: Peter Constable: "RE: Encoded rendering instructions (was Unicode's Mandate)"

    Jukka K. Korpela <jkorpela at cs dot tut dot fi> wrote:

    > Unfortunately many standards and recommendations that prescribe the
    > use of special characters do not identify them by Unicode numbers or
    > names or in any other unique manner.

    The UN/LOCODE standard, which assigns an alphabetic code to 44,000
    locations worldwide, seems particularly confused in this regard.

    The Secretariat's Notes state that "UN/LOCODE... is produced mainly
    using the United States character set 437," but this is not true since
    the data includes characters that are not present in MS-DOS code page
    437 (such as ã) and even some that are not in ISO 8859-1 (such as š).
    In fact, the data is in Windows code page 1252.

    The Notes go on to state that "the sorting order follows the one
    specified for that character set [CP437]," but this is also not true,
    judging from the order in which these three locations in Argentina are
    presented:

    Concordia
    Córdoba
    Corrientes

    Here the o-with-acute is simply folded to o. Other entries, in Angola
    for example, show that apostrophes are ignored in sorting. This is
    probably good sorting practice, but contradicts the claim that straight
    code point order is used (CP437 or otherwise).

    Apparently the presence of non-ASCII characters in the names of
    worldwide locations has proved to be a noticeable problem for some
    users, because each name is presented twice, once spelled correctly and
    once "without diacritics." I don't know if "with diacritics" is the
    correct term for a character like Ø, but that it what they call it.

    The Notes state, "International ISO Standard character sets are laid
    down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United
    States character set (437), which conforms to these ISO standards, is
    also widely used in trade data interchange)." I wonder how much data is
    still interchanged in MS-DOS code pages. In any case, the mention of
    ISO 10646 (an antique version, at that) seems curious since that
    standard is not used.

    UN/LOCODE could be improved by removing the references to CP437,
    especially the one calling it "the standard United States character
    set," and by correcting other errors related to characters and sorting.
    They might even consider converting it to UTF-8 or another Unicode
    format.

    I've been thinking about communicating this to the UN/ECE Secretariat,
    but it seems I just did.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Mar 07 2005 - 23:49:44 CST