Re: Windows and Mac character encoding questions

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Mar 30 2004 - 11:27:19 EST

  • Next message: Ernest Cline: "Fixed Width Spaces (was: Printing and Displaying Dependent Vowels)"

    From: "John Cowan" <cowan@ccil.org>
    > Mark Davis scripsit:
    >
    > > Some more details. Usually, by 'extension' one means a superset of
    > > the mappings. windows-1252 is formally disjoint from iso-8859-1 --
    > > not a superset -- since it has mappings for 0x80..0x9F which are
    > > different from iso-8859-1's mappings for the same bytes.
    >
    > I don't have access to ISO 8859-1 itself, but ECMA-94 (1986), which is
    > supposed to be equivalent, doesn't actually define anything for 0x80..0x9F.
    > So I think the term "superset" is in fact justified.

    A "superset" view is probably correct face to ECMA-94, but not for any
    ISO-8859-* which assigns C1 controls in positions 0x80..0x9F.
    So Windows-1252 can't be viewed as a superset of ISO-8859-1 but of ECMA-94, but
    only if no C1 controls are assigned by ECMA-94.

    If I read its reference there (second edition, published 6 June 1986, as the
    approved proposal for further adoption by ISO):
    http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf, and
    titled "Standard ECMA-94 - 8-Bit Single-Byte Coded Character Sets - Latin
    Alphabets No. 1 to 4":

    - There's already a normative reference to ECMA-6 (ISO 646) for the 7 bit
    character set. Note however that this refers to the invariant set of ISO-646,
    which excludes positions hex 40, 5B-60, 7B-7E.

    - Plus other references to ECMA-35 (code extension techniques), ECMA-43 (8-bit
    coded character set - structure and rules), ECMA-48 (control functions)

    ECMA-94, because it has enough extensions in the high part, deprecates the
    "national variants" of the 7-bit lower part, which becomes now tightly linked to
    the US variant of ISO646 (so the previous required fixed position for the
    international currency symbol in 7-bit alphabet is no longer needed for 8-bit
    encoding).

    ECMA-94 does not mandate codes outside of codes hex 20 to 7E and A0 to FF
    (called the G0 and G1 subsets). So both ISO-8859-1 and Windows1252 are
    conforming implementations of ECMA-94, because they both implement the same G0
    and G1 subsets (94- and 96-characters subsets)

    Read how the charts clearly make distinctions for "unused" positions in G0 and
    G1 ("shall not be used") and for other positions (out of scope of the standard,
    this does not make any requirement on these code positions).

    Controls or other codepositions are out of scope of ECMA-94. And you should
    better refer to ECMA-35, ECMA-43 and ECMA-48 for them...
    The first read should then be ECMA-43 (third edition, december 1991): it
    describes the overall 8-bit coding structure, and the positions used by C0,
    SPACE, DELETE, C1, and even the extension mechanism that allows coding more
    characters than those in G0 and G1; also there are conformance levels here:
    http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-043.pdf

    ISO-8859-1 is conforming with ECMA-43 at level 1 (not at higher levels), due to
    the extension mechanism with G2 and G3 subsets (through SS2 and SS3 sequences or
    with LSxR shift modes)....
    See also how ECMA-43 describes the controls coded at positions hex 0E-0F and
    8E-8F...
    Also ECMA-48 is the definitive reference for C0 controls.
    Some other mechanisms allow coding "ligatures" such as "Pts" with the GCC
    control function, coded in C1.
    ECMA-35 defines the role of the LS1 and LS2 controls (commonly named SI and SO
    in ASCII), but they are not to be used with 8-bit ECMA-43 (where they have no
    particular requirement, and are left to application defined behavior for all
    conforming 8-bit coded charsets).

    ECMA-94 appears then only as a complementary standard for only 4 particular
    subcases of ECMA-43, i.e. the use of ECMA-43 for Latin, Greek, Cyrillic and
    Arabic basic scripts... For controls, the more definitive European reference is
    then ECMA-48 (fifth edition, june 1991):
    http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf
    The previous fourth edition was adopted by ISO/IEC in standard ISO 6429, but
    ECMA-94 fifth edition adds controls for bidirectional text handling.

    Note also that all C1 controls can also be coded using 7-bit only sequences
    starting with ESC. Depending on the encoding annoucement sequence, using the
    two-byte encoding of C1 controls may be authorized, mandated or forbidden.
    ECMA-94 has no impact on them.

    From all what I read, this is probably the most complete and most accurate and
    comprehensive source of definitions for control functions, which can endorse and
    comply with other ANSI and ISO standards. Still, this ECMA-48 standard (as well
    as ECMA-94) cannot be used without a encoding framework. And ECMA-43 is the one
    that should be adhered first.

    In this case, ISO-8859-1 is conforming to ECMA-43, but not Windows-1252...



    This archive was generated by hypermail 2.1.5 : Tue Mar 30 2004 - 12:14:40 EST