Re: CP1252 under Unix

From: Frank da Cruz (fdc@columbia.edu)
Date: Sat Mar 25 2000 - 15:50:48 EST


Markus wrote:
> Frank da Cruz wrote on 2000-03-25 17:23 UTC:
> > Then Markus went on to list the graphics in the 0x80-0x9F range of
> > CP1252. Now, I was reading his message in a terminal window (in
> > Windows, by the way, not Unix) that conforms to ISO 2022, 4873, and
> > 6429. Here's what happened:
> >
> > 0x95: LATIN CAPITAL LETTER Y WITH DIAERESIS
> > This is C1 control APC (Application Program Command). It makes any
> > ANSI X3.64 / ISO-6429 compliant terminal hang forever waiting for
> > the rest of the APC sequence, which never comes. Thank goodness for
> > the reset button.
>
> Taking my spare-time character-set fanatic hat off and putting my
> day-job computer security hat on for a moment, I'd strongly advise Frank
> (and developers of email software that runs in VT100 emulators) to
> ensure that only the following C0/C1 characters received from outside
> the ivory tower ever be forwarded to the terminal...
> ... The "less" pager under Unix usually does a fine job of that for
> instance (and starting with version 346 it even supports UTF-8!).
>
So every application on earth that was originally coded to conform to
standards must be recoded not only to support UTF-8, which doesn't break
any rules, but also CP1252, which tramples all over them. I can think of
about 1000 better ways to spend my time.

I usually do have a sense of humor and appreciate a good prank, but really!

Markus, what does your own UTF-8 Xfree86 VT220-emulating xterm do? On the
one hand you can use it to log in to VMS, which makes serious use of C1
controls. On the other hand you can use it to read mail that contains
"smart quotes". Which of these works? They can't both work. If it works
for "smart quotes", then you must have deliberately broken the standards-
conforming aspect that lets it emulate VT320 and above and therefore work
with VMS, which means it can no longer claim to be a VT220.

ISO 6429 data streams are entirely analagous to UTF-8. One passes them
though a mindless finite-state-automaton which understands their (well-
designed and consistent) structure before looking at the data. This
confers numerous benefits, not the least of which is the ability to deal
with (e.g. discard) unknown escape sequences (others include character-set
switching). So adding new sequences (e.g. for a new model-terminal, like
VT420 after VT320) doesn't hurt terminals that don't know about the new
sequences. The structure is everything. These FSAs are built into
countless software programs and computer chips.

And so it is with UTF-8. On platforms that implement Unicode as UCS-2 or
UTF-16 internally, the incoming UTF-8 stream must pass through a mindless
finite-state automaton before any application-level code can see it. The
FSA relies entirely on the structure of UTF-8. If new characters are
added to Unicode, it doesn't break the UTF-8 parser. But if somebody came
along and "enhanced" UTF-8, e.g. to allow single-byte Latin-1 characters
(as so many readers have proposed), this would break every UTF-8 parser
that followed the rules.

Let's hope the same cavalier attitude that prevails towards ISO standards in
this forum does not carry over to UTF-8. I mean really, if I can make more
money by enhancing UTF-8 than by following the rules, why shouldn't I do it?
Will the standards police throw me in jail?

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT