Re: CP1252 under Unix

From: Mark Davis (markdavis@ispchannel.com)
Date: Sat Mar 25 2000 - 20:16:27 EST


Frank da Cruz wrote:

> > I agree with you that the overal goal should be to move to UTF-8 for
> > transmission. However, ignoring 1252 and its cousins is both wrong and
> > shortsighted.
> >
> > 1. Let's start with the wrong part. There are already IANA registered
> > charsets that use the C1 area for graphic character sets.
> >
> The true question is whether IANA should have "registered" any of them.
> Again, private, proprietary character sets have no place in any standard,
> nor on the wire across the Internet. I really don't know what they were
> thinking when they started down this path. I fully agree with you that
> once they have registered Windows 1251 then they have no reason not to
> register 1252 and every other code page that exists, and not just at IBM
> and Microsoft either.
>
> But please folks, let's not confuse the IANA registry with any kind
> of standard. Standards imply at the very least consensus among conflicting
> interests, and at best also some measure of quality control.

I agree. Registries are not standards -- they are just places to get standard *names* for things. The things themselves may be extremely odd.

> > 2. Now for the shortsighted part. The IANA registry is used for much more
> > than simply interchange on the web. A registry of charset names is needed
> > across all systems and platforms. That way, cross-platform programs can to
> > identify the local charsets, and successfully and accurately translate those
> > to and from Unicode/10646 or specific other codesets.
> >
> Again: No, no, no! If you don't put private character sets on the wire,
> you don't need to know a thing about them.

We are coming from a different perspective. For cross-platform products, we both have to look at the wire, and what is used internally on those platforms. Even if we are excluded to the wire, we have to look at what is being sent around, not just what we wish were sent around.

Ideally, everything would be in UTF-8 on the wire, and we wouldn't have to worry about this. I believe that we are moving in that direction, and accelerating. However, in the interim we have to pay attention to what is there.

> Does anybody who reads this list truly believe it is better to use private
> code pages for interchange than it is to use standard ones? That means I can
> send you ANYTHING AT ALL, even something you've never heard of, and it's your
> fault if you can't read it, not mine.

> If you are selling a Windows-based email client, HOW HARD IS IT to convert
> outgoing mail from the local code page to ISO 8859 or other standard character
> set? Ditto for Windows-based Web authoring tools. The fact that this has

I agree with you -- far better if people only sent a few standard codepages. But until UTF-8 started being widely supported, I can see why people would end up sending 1252 in mail -- that way their quotation marks didn't get munged.

> not been done is no reason for the rest of the planet to drop what they are
> doing (presumably moving us along towards a Unicode based network) and bend
> over backwards to accommodate this kind of behavior.

I guess I don't see it as bending over backwards. If you have a UTF-8-capable system, then handling 1252 characters (IF properly identified) means a simple mapping of the characters to the UTF-8 equivalents on input, and the reverse on output. And the mapping tables for 1252 are far, far simpler than the ones for Asian standards. So the incremental cost is pretty low.

I firmly agree with you that the fewer the better; the only question is whether the few most common sets need to be accomodated. And let's face it -- an email program or browser that accepts 1252 will be more attractive to the vast majority of customers than one that doesn't.

> > Our goal is to converge towards use of a single character set, but that
> > transition is easier if we can precisely identify those character sets that
> > ARE in use on the Web currently, not hiding our heads in the sand and hoping
> > they will go away.
> >
> This is a backwards view of the problem. It is the responsibility of IBM,
> Apple, Microsoft, and other companies with private character sets, or makers
> of software that use these private sets for interchange, to convert them to
> use standard sets, preferably UTF-8. That's where the problem is and that's
> where to fix it.

I think this vastly overstates the control that IBM, Apple and Microsoft have over the use of their character sets. It is not as if they can prevent, say Eudora, from sending and receiving the host character sets; or any other character sets it wants to.

> Put yourself in the position of an ISV. I want to be a good world citizen.
> What must I do? Should I code my applications for Unicode? No, that's not
> enough. I have to code them to understand every character set that exits --
> or at least that is significant in the marketplace (which marketplace?).
>
> Does this promote the spread of Unicode?

We all agree on the need to go towards Unicode -- and IBM, Apple and Microsoft have consistently pushed in that direction. However, during the transition to Unicode, it is important to be able to interpret all legacy data property, in whatever context it occurs.

In the position of an ISV (with, say, a server product), I would say:
a. accept Unicode input,
b. also accept and interpret whatever character sets are in wide use by my customers.
c. emit Unicode whenever the client-side can handle it.
d. otherwise emit legacy character set.

(b) and (d) will become less and less important over time.

> - Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT