RE: Conformance (was UTF, BOM, etc)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 10:50:24 CST

  • Next message: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

    Peter Kirk wrote:

    > ... indicates this encoding (as
    > permitted but not
    > encouraged in the Unicode standard) by preceding the string of
    > characters with a BOM, which is not one of those characters. But your
    > definition of "plain text" seems rather different, more like
    > a string of
    > arbitrary bytes which is supposed to have some interpretation as
    > characters but whose encoding is unknown at this level,
    > rather like the
    > serialised data passed across my interface D above. This is perhaps
    > more like what Unix does in practice. But I don't think it is
    > helpful to
    > define "plain text" in this way.

    Actually, what I meant with "plain text" is the text minus the BOM. And if I
    understand you correctly, we agree.

    My claim that there is no "plain text" on Windows was somewhat sarcastic.
    Because plain text files are so rare. Because command line utilities are so
    rare. Because you sometimes even have problems exporting data to plain text.
    Because you sometimes even have problems copying text from dialogues to
    clipboard.

    So, most of the text files are really text documents. Created by Notepad and
    the like. Hence my claims Windows doesn't even need a plain text editor.

    > ... system's default code page. This cannot be
    > UTF-8, and so these files cannot start with a BOM

    Actually, they're not that far from it. Try "mode CON CP SELECT=65000". It
    is unsupported. Why?

    OK, if you use any SBCS codepage, the very least you get is preservance of
    stdin. Text might not be interpreted or displayed correctly, but simple
    processing like using the 'more' command will at least not drop data. The
    problem with any legacy encoding is that not all Unicode characters can be
    represented. Causing dropping of data in output if input is Unicode. Indeed,
    stdin cannot be Unicode, but input in this case can be argv or a list of
    files obtained from the system via Unicode API, or contents of a Unicode
    file, assuming application knows how to read it (fopen won't do). By the
    way, (f)printf is buggy to the extent of not just dropping unconvertable
    codepoints, it aborts output. Well, fputc signaled and error, and fprintf
    didn't know what to do with the signal. Unicode typically says to signal an
    error in the conversion. And fputc did exactly that. One could argue that
    fprintf and fputc did not agree on what they are doing. Well, it will happen
    quite often. In some cases it is better to define behavior on invalid data
    than to simply pass the problem on by cleverly stating that the error has
    been signaled.

    Now consider that user's (!) default code page is UTF-8 (so 65000). You
    would get proper output and no dropping for Unicode data. But what happens
    is that applications start dropping data on the stdin. Because invalid
    sequences are dropped. And with dropped I make no distinction between
    skipping them and replacing them with U+FFFD. It is dropping data.

    It would be nice to have UTF-8 as a default code page, wouldn't it? Someone
    must have realized that dropping data on the stdin is more than users would
    be willing to accept. Well, we can wait a couple of years to get all the out
    of band data sorted out. Or clutter everything with BOMs. Maybe then we'll
    know when the data is UTF-8 and when it is not. Maybe we will, maybe we
    won't. How about defining how to convert invalid UTF-8 sequences to
    codepoints? It would start working. Indeed no better than things work today.
    But the "current code page" concept did not differentiate between different
    encodings. Why should we differentiate UTF-8 from the rest? Of course it
    would be useful, but can it be done reliably? Can it be done in near future?

    > This actually makes it difficult to create batch
    > files with
    > Notepad (the extension has to be changed manually), but it is
    > still only
    > a partial answer.

    Notepad's handling of extensions is poor, yes. But I was thinking another
    thing the other day. As I suggested Notepad should only emit BOM for .txt
    files and no others, I of course immediately realized the problem of
    renaming. Then I said to myself, how about explorer strips or adds the BOM
    as the file is renamed. Very Windows-like. Much like hiding the extensions
    in Explorer. Anyway, a frightening thought. But then, why change the actual
    contents. Should the filesystem really support out of band data, it could
    simply always consume the BOM (all BOMs?) and the CRs written. And
    automatically feed them to the applications opening the file in text mode.
    Of course newer applications would get the ability to read the out of band
    data and would open the file in binary mode. Older applications would get
    double the overhead (though the system can its part pretty efficiently).
    Newer would get rid of the overhead and we would get rid of the clutter from
    the files. In interchange, the binary image would typically be used, and
    UNIX would be happy.

    There is of course a problem of labeling all the data. But after all, is it
    not the data that should be labeled as text, rather than application being
    the one that decides which data to open as text? I am starting to like the
    OOB approach. But I still think we cannot rely solely on it to address the
    problems.

    Lars



    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:11:09 CST