Re: CP1252 under UNIX

From: Frank da Cruz (fdc@columbia.edu)
Date: Fri Mar 31 2000 - 11:48:53 EST


Doug Ewell <dewell@compuserve.com> wrote:

> Here's what I have to contribute to this hot topic.
>
It is time to let this discussion rest, but your message only adds fuel
to the fire.

> 2. "CP1252 is not a standard."
>
> Oh, but it is. True, it's not an ISO or ANSI standard, not a de jure
> standard, but it IS a de facto standard. It is an industry standard.
> It is used by a LOT of people.
>
Suppose a manufacturer of automobiles makes cars that are too wide
to fit in a lane of traffic, and for this and other reasons, they become
the dominant car seller. The new wider cars become the de facto standard,
even though they force "legacy" cars off the road, often causing them to
crash. If we took data communications standards as seriously as we take the
standards pertaining to more tangible forms of transportation, traffic on
the Internet would flow much more smoothly, and reach its destination
intact, and everybody would be happier, ourselves included.

> It was stated that "1252 violates the very basis for character set
> standards" and "All standard character sets comply with ISO 4873 and ISO
> 2022." This is based on the fact...
>
It's not based on any fact. It _is_ a fact :-)

> ... that terminal-host communication
> relies on character sets that comply with 4873 and 2022, and it implies,
> quite in contrast to the misguided who believe that terminals don't
> matter, that terminals are the ONLY thing that matter!
>
It all matters. But why should there be two different 8-bit character sets
for terminal-host communication and for Web browsing in the "Latin-1
languages", when a single standard one one would do?

Some have suggested that those who read their email with terminals or
emulators should "upgrade" their email clients to "properly" handle CP1252
and other private code pages. That's absurd. If I write or buy an email
client that obeys all the rules, I should not need to change it constantly
as creative new ways are found to break the rules.

Others have suggested that it is the responsibility of end users to take
defensive measures to prevent nonstandard character sets from hanging or
confusing their terminals or emulators. That's not right either. Our
terminals and emulators *use* the C1 controls in valid, real-life,
standards-conforming applications. I do not know in advance if a particular
email message is going to fry my terminal, especially if its character set
is not announced. Anyway, what if I have a real terminal, such as a VT420?
In that case, there is no recourse -- whenever such a message arrives,
the terminal becomes useless and must be reset. The message can not be
read unless you put the terminal into debug mode, in which case it will
show C1 Control Pictures in place of "smart quotes" (and for that matter,
C0 Control Pictures in place of CR, LF, Tab, etc).

> By this metric, no EBCDIC code page could ever be a standard.
>
And indeed no EBCDIC code page is a standard, and I don't believe anybody
ever said they were or should be. IBM has done an excellent job of keeping
their private EBCDIC code pages private, and converting them to standard
character sets for interchange, and for that matter even publishing
official mappings, so ISVs don't have to guess and come up with incompatible
ones.

> Even UTF-8 could not be a standard, because of its use of characters in
> the 0x80-0x9F range. (Or are the ISO 2022 escape sequences mentioned in
> Annex R what make UTF-8 a standard?)
>
Now this is an interesting point. Personally, I think UTF-8 would have
been better if it did not include C1-valued bytes. This would allow it to
pass "bare" through devices that are sensitive to these byte values.

Now as far as I know, there are no UTF-8 terminals. But there are emulators
(we make one here). Assuming the UTF-8 data stream arrives intact at the
emulator (as it does on most kinds of connections, e.g. Telnet), the
emulator decodes the UTF-8 first and only then examines the data stream for
control sequences.

In other words, UTF-8 is a transformation of a standards-compliant data
stream.

> That said, CP1252 is not supported by everyone (any more than UTF-8 is,
> at least yet) and you can make your text available to a great many more
> people by encoding it in ISO 8859-1 instead.
>
Amen.

> 3. "If you support 1252 you have to support the hundreds of private
> character sets being created every day."
>
> They are? In Western Europe and North America the REALLY COMMON 8-bit
> character sets are ASCII, ISO 8859-1, CP1252, MacRoman, and maybe a
> smattering of CP437. Are any others so common that they present the
> kind of headache we are talking about? In Central and Eastern Europe,
> of course, there is a lot more diversity in encoding, but this has
> nothing to do with 8859-1 vs. 1252.
>
If Windows-based email and web-authoring tools generate CP1250, CP1251,
etc, instead of the appropriate parts of ISO 8859 (or UTF-8), then we have
what we all can agree is a real problem.

On the other hand, if they do not, then we must wonder why they generate
CP1252 in the "Latin-1 speaking" environment.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT