Re: Usage of CP1252 characters

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Tue Jul 08 1997 - 11:57:55 EDT


> Any registry that deserves this name should have well-established rules
> for how cases like modification of registered objects are handeled
> in the registry. It should also have some rules about how the name
> space of the registry is organized and what sources for registrations
> are acceptable.
>
> The IANA registry seems to be pretty sloppy with regard to some of these
> requirements, I guess. The name space is highly inconsistent, there
> is debate about what is a new character set and what not, and the
> source of Microsoft character sets are Hewlett Packard laser printer
> manuals. I am not exactly impressed.
>
I'm not either. The IETF / IANA pretty much registers anything anybody
can dream up, which is not a great idea in general, and especially not
when dealing with character sets.

> ECMA on behalf of ISO also operates a character set registry. Their
> name space is an integer number and an ESC sequence, and if I remember
> the registry rules correctly, modifying a character set by adding a new
> character requires a new number to be assigned, but not a new ESC
> sequence (but I can't check the details right now).
>
> The ISO character set registry is freely available from ECMA
> (warning: these are several kilograms of paper!). The problem
> with the ISO registry is that it lives in the ISO 2022 framework.
>
I don't view that as a problem at all. These are character sets for
interchange, and therefore must adhere to a common structure. There is
no need to register corporate private character sets -- the fact is, they
should never appear on the wire, period. If I have a Macintosh, AViiON,
NeXT, etc, I should not need to know how to decode a Microsoft Windows
code page (any more than I should have to know how to decode a Word for
Windows document sent to me in MIME email). The MIME philosophy is
exactly the opposite of the "founding parents" of the Internet: be
profligate in what you send, and let receiver be puzzled -- i.e. as
long as you have tagged what you sent, you've done your job.

> As ISO 2022 does not allow the C1 range to contain graphical
> characters...
>
For good reason. They are control positions. This design is to make
all character sets have the same shape so they work equally well in the
7-bit and 8-bit communication environments. Remember: many (most)
communication channels are not transparent to control characters.

Code pages are fine for internal use on a PC, but they have no business on
the wire in an open communications system, and no business being registered
as an international standard.

ISO 2022, 4873, and the Registry are incredibly well thought out, time
tested, proven, stable, and sensible. Unicoders like to disparage them,
but these standards and mechanisms have served us well for decades; their
only shortcoming is the limited size of their repertoire and to some extent
the complications of switching among different sets -- complications which
are by no means insurmountable.

Unicode was supposed to address these problems by offering a comphrehensive
repertoire in a flat 16-bit space, but recent discussions are showing that
the simplicity and flatness are elusive, and the longing for subsets
serving this or that constituency makes us wonder about the universality.

To the extent that a particular character can be identified by a single
code point in Unicode, I find that Unicode is an extremely useful
reference tool for mapping between all other character sets, standard and
private.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT