RE: Conformance (was UTF, BOM, etc)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 10:50:24 CST

Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

Previous message: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe in reply to: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Next in thread: Peter Constable: "RE: Conformance (was UTF, BOM, etc)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Kirk wrote:

> ... indicates this encoding (as
> permitted but not
> encouraged in the Unicode standard) by preceding the string of
> characters with a BOM, which is not one of those characters. But your
> definition of "plain text" seems rather different, more like
> a string of
> arbitrary bytes which is supposed to have some interpretation as
> characters but whose encoding is unknown at this level,
> rather like the
> serialised data passed across my interface D above. This is perhaps
> more like what Unix does in practice. But I don't think it is
> helpful to
> define "plain text" in this way.

Actually, what I meant with "plain text" is the text minus the BOM. And if I
understand you correctly, we agree.

My claim that there is no "plain text" on Windows was somewhat sarcastic.
Because plain text files are so rare. Because command line utilities are so
rare. Because you sometimes even have problems exporting data to plain text.
Because you sometimes even have problems copying text from dialogues to
clipboard.

So, most of the text files are really text documents. Created by Notepad and
the like. Hence my claims Windows doesn't even need a plain text editor.

> ... system's default code page. This cannot be
> UTF-8, and so these files cannot start with a BOM

Actually, they're not that far from it. Try "mode CON CP SELECT=65000". It
is unsupported. Why?

OK, if you use any SBCS codepage, the very least you get is preservance of
stdin. Text might not be interpreted or displayed correctly, but simple
processing like using the 'more' command will at least not drop data. The
problem with any legacy encoding is that not all Unicode characters can be
represented. Causing dropping of data in output if input is Unicode. Indeed,
stdin cannot be Unicode, but input in this case can be argv or a list of
files obtained from the system via Unicode API, or contents of a Unicode
file, assuming application knows how to read it (fopen won't do). By the
way, (f)printf is buggy to the extent of not just dropping unconvertable
codepoints, it aborts output. Well, fputc signaled and error, and fprintf
didn't know what to do with the signal. Unicode typically says to signal an
error in the conversion. And fputc did exactly that. One could argue that
fprintf and fputc did not agree on what they are doing. Well, it will happen
quite often. In some cases it is better to define behavior on invalid data
than to simply pass the problem on by cleverly stating that the error has
been signaled.

Now consider that user's (!) default code page is UTF-8 (so 65000). You
would get proper output and no dropping for Unicode data. But what happens
is that applications start dropping data on the stdin. Because invalid
sequences are dropped. And with dropped I make no distinction between
skipping them and replacing them with U+FFFD. It is dropping data.

It would be nice to have UTF-8 as a default code page, wouldn't it? Someone
must have realized that dropping data on the stdin is more than users would
be willing to accept. Well, we can wait a couple of years to get all the out
of band data sorted out. Or clutter everything with BOMs. Maybe then we'll
know when the data is UTF-8 and when it is not. Maybe we will, maybe we
won't. How about defining how to convert invalid UTF-8 sequences to
codepoints? It would start working. Indeed no better than things work today.
But the "current code page" concept did not differentiate between different
encodings. Why should we differentiate UTF-8 from the rest? Of course it
would be useful, but can it be done reliably? Can it be done in near future?

> This actually makes it difficult to create batch
> files with
> Notepad (the extension has to be changed manually), but it is
> still only
> a partial answer.

Notepad's handling of extensions is poor, yes. But I was thinking another
thing the other day. As I suggested Notepad should only emit BOM for .txt
files and no others, I of course immediately realized the problem of
renaming. Then I said to myself, how about explorer strips or adds the BOM
as the file is renamed. Very Windows-like. Much like hiding the extensions
in Explorer. Anyway, a frightening thought. But then, why change the actual
contents. Should the filesystem really support out of band data, it could
simply always consume the BOM (all BOMs?) and the CRs written. And
automatically feed them to the applications opening the file in text mode.
Of course newer applications would get the ability to read the out of band
data and would open the file in binary mode. Older applications would get
double the overhead (though the system can its part pretty efficiently).
Newer would get rid of the overhead and we would get rid of the clutter from
the files. In interchange, the binary image would typically be used, and
UNIX would be happy.

There is of course a problem of labeling all the data. But after all, is it
not the data that should be labeled as text, rather than application being
the one that decides which data to open as text? I am starting to like the
OOB approach. But I still think we cannot rely solely on it to address the
problems.

Lars

Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Previous message: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe in reply to: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Next in thread: Peter Constable: "RE: Conformance (was UTF, BOM, etc)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:11:09 CST