Re: Usage of CP1252 characters

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Mon Jul 07 1997 - 21:21:54 EDT


"Unicode Discussion" wrote on 1997-07-07 23:07 UTC:
> The addition of the Euro at 0x80 raises a separate but interesting
> issue for the IANA charset registry. If addition of the Euro at 0x80
> invalidates the identity of a charset--which I agree that technically
> it does (since it changes for one octet the way the charset it
> converted to characters, thereby creating a new MIME entity), then
> not only does CP1252 diverge (by one code point) from
> ISO-8859-1-windows-3.1-Latin-1, but also for each
> of the other Windows code pages getting the Euro (probably all of
> them), those code pages then diverge from the IANA charset registry
> in the same way.

Any registry that deserves this name should have well-established rules
for how cases like modification of registered objects are handeled
in the registry. It should also have some rules about how the name
space of the registry is organized and what sources for registrations
are acceptable.

The IANA registry seems to be pretty sloppy with regard to some of these
requirements, I guess. The name space is highly inconsistent, there
is debate about what is a new character set and what not, and the
source of Microsoft character sets are Hewlett Packard laser printer
manuals. I am not exactly impressed.

ECMA on behalf of ISO also operates a character set registry. Their
name space is an integer number and an ESC sequence, and if I remember the
registry rules correctly, modifying a character set by adding a new character
requires a new number to be assigned, but not a new ESC sequence (but I
can't check the details right now).

The ISO character set registry is freely available from ECMA
(warning: these are several kilograms of paper!). The problem
with the ISO registry is that it lives in the ISO 2022 framework.
As ISO 2022 does not allow the C1 range to contain graphical
characters, character sets like CP1252 and CP437 are currently
not able to get registered in the ISO/ECMA registry. At least that is
what the IBM charset gurus told me last time I asked what the ISO ESC
sequence for CP437 is (for implementation in the Linux console driver).
ISO 2022 should definitely be upgraded to support 8-bit character sets
where C1 is populated by graphical characters.

ISO 2022 is also ECMA-35, available on <http://www.ecma.ch/>.

> What an unholy mess these MIME charsets are!

Yes. This registry seems to have been maintained by someone pretty
inexperienced in maintaining registries and name spaces. I would
have expected otherwise from IANA, or may be it is chaos by design ...

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT