Re: Windows and Mac character encoding questions

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Mar 30 2004 - 11:27:19 EST

Next message: Ernest Cline: "Fixed Width Spaces (was: Printing and Displaying Dependent Vowels)"

Previous message: Asmus Freytag: "Re: Printing and Displaying Dependent Vowels"
In reply to: John Cowan: "Re: Windows and Mac character encoding questions"
Next in thread: Kenneth Whistler: "Re: Windows and Mac character encoding questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "John Cowan" <cowan@ccil.org>
> Mark Davis scripsit:
>
> > Some more details. Usually, by 'extension' one means a superset of
> > the mappings. windows-1252 is formally disjoint from iso-8859-1 --
> > not a superset -- since it has mappings for 0x80..0x9F which are
> > different from iso-8859-1's mappings for the same bytes.
>
> I don't have access to ISO 8859-1 itself, but ECMA-94 (1986), which is
> supposed to be equivalent, doesn't actually define anything for 0x80..0x9F.
> So I think the term "superset" is in fact justified.

A "superset" view is probably correct face to ECMA-94, but not for any
ISO-8859-* which assigns C1 controls in positions 0x80..0x9F.
So Windows-1252 can't be viewed as a superset of ISO-8859-1 but of ECMA-94, but
only if no C1 controls are assigned by ECMA-94.

If I read its reference there (second edition, published 6 June 1986, as the
approved proposal for further adoption by ISO):
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf, and
titled "Standard ECMA-94 - 8-Bit Single-Byte Coded Character Sets - Latin
Alphabets No. 1 to 4":

- There's already a normative reference to ECMA-6 (ISO 646) for the 7 bit
character set. Note however that this refers to the invariant set of ISO-646,
which excludes positions hex 40, 5B-60, 7B-7E.

- Plus other references to ECMA-35 (code extension techniques), ECMA-43 (8-bit
coded character set - structure and rules), ECMA-48 (control functions)

ECMA-94, because it has enough extensions in the high part, deprecates the
"national variants" of the 7-bit lower part, which becomes now tightly linked to
the US variant of ISO646 (so the previous required fixed position for the
international currency symbol in 7-bit alphabet is no longer needed for 8-bit
encoding).

ECMA-94 does not mandate codes outside of codes hex 20 to 7E and A0 to FF
(called the G0 and G1 subsets). So both ISO-8859-1 and Windows1252 are
conforming implementations of ECMA-94, because they both implement the same G0
and G1 subsets (94- and 96-characters subsets)

Read how the charts clearly make distinctions for "unused" positions in G0 and
G1 ("shall not be used") and for other positions (out of scope of the standard,
this does not make any requirement on these code positions).

Controls or other codepositions are out of scope of ECMA-94. And you should
better refer to ECMA-35, ECMA-43 and ECMA-48 for them...
The first read should then be ECMA-43 (third edition, december 1991): it
describes the overall 8-bit coding structure, and the positions used by C0,
SPACE, DELETE, C1, and even the extension mechanism that allows coding more
characters than those in G0 and G1; also there are conformance levels here:
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-043.pdf

ISO-8859-1 is conforming with ECMA-43 at level 1 (not at higher levels), due to
the extension mechanism with G2 and G3 subsets (through SS2 and SS3 sequences or
with LSxR shift modes)....
See also how ECMA-43 describes the controls coded at positions hex 0E-0F and
8E-8F...
Also ECMA-48 is the definitive reference for C0 controls.
Some other mechanisms allow coding "ligatures" such as "Pts" with the GCC
control function, coded in C1.
ECMA-35 defines the role of the LS1 and LS2 controls (commonly named SI and SO
in ASCII), but they are not to be used with 8-bit ECMA-43 (where they have no
particular requirement, and are left to application defined behavior for all
conforming 8-bit coded charsets).

ECMA-94 appears then only as a complementary standard for only 4 particular
subcases of ECMA-43, i.e. the use of ECMA-43 for Latin, Greek, Cyrillic and
Arabic basic scripts... For controls, the more definitive European reference is
then ECMA-48 (fifth edition, june 1991):
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf
The previous fourth edition was adopted by ISO/IEC in standard ISO 6429, but
ECMA-94 fifth edition adds controls for bidirectional text handling.

Note also that all C1 controls can also be coded using 7-bit only sequences
starting with ESC. Depending on the encoding annoucement sequence, using the
two-byte encoding of C1 controls may be authorized, mandated or forbidden.
ECMA-94 has no impact on them.

From all what I read, this is probably the most complete and most accurate and
comprehensive source of definitions for control functions, which can endorse and
comply with other ANSI and ISO standards. Still, this ECMA-48 standard (as well
as ECMA-94) cannot be used without a encoding framework. And ECMA-43 is the one
that should be adhered first.

In this case, ISO-8859-1 is conforming to ECMA-43, but not Windows-1252...

Next message: Ernest Cline: "Fixed Width Spaces (was: Printing and Displaying Dependent Vowels)"
Previous message: Asmus Freytag: "Re: Printing and Displaying Dependent Vowels"
In reply to: John Cowan: "Re: Windows and Mac character encoding questions"
Next in thread: Kenneth Whistler: "Re: Windows and Mac character encoding questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 30 2004 - 12:14:40 EST