Re: Windows and Mac character encoding questions

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 30 2004 - 18:34:44 EST

  • Next message: Rick McGowan: "Unicode 4.0.1 Released"

    > > I don't have access to ISO 8859-1 itself, but ECMA-94 (1986), which is
    > > supposed to be equivalent, doesn't actually define anything for
    > 0x80..0x9F.
    > > So I think the term "superset" is in fact justified.

    ISO/IEC 8859-1:1998 does not define any characters or mappings
    for 0x80..0x9F (or for 0x00..0x1F, for that matter):

    "The graphic characters of this part of ISO/IEC 8859 constitute
    a single coded character set. However in accordance with ISO/IEC
    2022 and ISO/IEC 4873 the code table of this part of ISO/IEC 8859
    may be considered to consist of the following components:

    - The character SPACE represented by bit combination 02/00;

    - a 94-character G0 graphic character set represented by bit
      combinations 02/01 to 07/14;
      
    - a 96-character G1 graphic character set represented by bit
      combinations 10/00 to 15/15."

    >
    > ECMA-94 says nothing about the C1 control set, it specifies only the
    > G0 and G1 graphics sets, but ECMA-43 (ISO 4873) does. The octets
    > 08/14 and 08/15 if present are only allowed to be used for the SS2
    > and SS3 control functions according to ECMA-43. If ISO 8859 says
    > anything about the control sets, I think it is safe to say that at the very
    > least it references ISO 4873.

    It does.

    > In that case, the windows-1252 use of
    > 0x8E as LATIN CAPITAL LETTER Z WITH CARON would violate
    > that standard.

    Correct. But Microsoft does not claim that Windows code pages are
    conformant to ISO/IEC 4873.

    > Also RFC 1345 indicates that the standard C0 and C1
    > control sets of ISO 6429 (ECMA-48) are used with ISO 8859-1, but I
    > can't be certain if that is just the usual assumption or explicitly given
    > in ISO 8859.

    It has nothing to do with ISO/IEC 8859-1:1998, per se.

    It has to do with the Internet and general Unix usage of 8859-1.

    RFC 1345 is Keld Simonsen's definition of character mnemonics and
    mapping of character sets using them. It is informational, and
    also out of date (1992), predating the current versions of all
    the 8859 standard's parts. The mappings that Keld used in RFC 1345
    also have the following characteristic:

    "If the coded character set is a 96-character set, it is tabled
    with the relevant GL set (normally ISO-IR-6 [ASCII]) and with
    ISO 6429 as C0 and C1."

    In other words, Keld added the C0 and C1 mappings, to reflect
    general practice in usage. But that usage does not directly
    reflect the relevant standards themselves.

    The Unicode Consortium also has a mapping posted:

    http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

    This mapping table also maps the ISO 6429 C0/C1 codes as well, to
    reflect general usage.

    >
    > In any case, windows-1252 is not ISO-2022 (ECMA-35) friendly.and
    > given the existence of LATIN CAPITAL LETTER Z WITH CARON
    > as 0x8E, it certainly does not fit nicely into the ECMA/ISO family
    > of interrelated character set standards, even if one overlooks the
    > fact that it uses graphics characters in the C1 control set.

    Well, yes. But I'm wondering what new ground we are breaking
    here. All significant PC code pages (IBM, Microsoft, Mac, and
    others now long dead) have violated ISO-2022 for twenty years
    now by using 0x80..0x9F for graphic characters. What exactly
    is the point of flogging Windows CP 1252 once again for not
    being "ISO-2022 friendly"?

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Mar 30 2004 - 19:15:51 EST