Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (
Date: Sat Jan 22 2005 - 11:58:46 CST

  • Next message: Jon Hanna: "RE: Conformance (was UTF, BOM, etc)"

    On 22/01/2005 16:50, Lars Kristan wrote:

    > ...
    > > ... system's default code page. This cannot be
    > > UTF-8, and so these files cannot start with a BOM
    > Actually, they're not that far from it. Try "mode CON CP
    > SELECT=65000". It is unsupported. Why?
    > ...
    > Now consider that user's (!) default code page is UTF-8 (so 65000).
    > You would get proper output and no dropping for Unicode data. But what
    > happens is that applications start dropping data on the stdin. Because
    > invalid sequences are dropped. And with dropped I make no distinction
    > between skipping them and replacing them with U+FFFD. It is dropping data.
    > It would be nice to have UTF-8 as a default code page, wouldn't it?
    > Someone must have realized that dropping data on the stdin is more
    > than users would be willing to accept. Well, we can wait a couple of
    > years to get all the out of band data sorted out. Or clutter
    > everything with BOMs. Maybe then we'll know when the data is UTF-8 and
    > when it is not. Maybe we will, maybe we won't. How about defining how
    > to convert invalid UTF-8 sequences to codepoints? It would start
    > working. Indeed no better than things work today. But the "current
    > code page" concept did not differentiate between different encodings.
    > Why should we differentiate UTF-8 from the rest? Of course it would be
    > useful, but can it be done reliably? Can it be done in near future?
    This is interesting speculation. But with any code page there are bytes
    or combinations of bytes which are illegal or undefined in that code
    page. When Windows (NT/2000/XP and so internally Unicode, represented as
    UTF-16) reads code page files as text, they are converted to Unicode.
    The correct behaviour when an illegal or undefined byte is found is to
    replace it with U+FFFD, and I think this is what Windows does. This you
    might also call dropping of data, although in fact it is not data but
    garbage, or data wrongly labelled and so misinterpreted as garbage.

    And if, speculatively, Windows were to support UTF-8 as a code page, the
    situation would be unchanged. Byte sequences which are illegal UTF-8 are
    garbage in that code page and so would correctly be replaced by U+FFFD.

    But then even if UTF-8 were supported as a code page I think I would
    keep Windows 1252 as my system code page. There is too much Windows 1252
    legacy data around which would be treated as garbage if UTF-8 were my
    system code page. The code page is used only by obsolescent legacy
    applications, and by modern applications reading legacy data. Windows
    Unicode support is adequate without trying to reinterpret legacy data as
    Unicode. And rather than try to trick old applications into supporting
    Unicode through UTF-8, the Windows strategy has rightly been to update
    the applications for proper Unicode support.


    > ... Very Windows-like. Much like hiding the extensions in Explorer. ...

    This is optional. An option which anyone who knows anything much about
    computers should switch off.

    Peter Kirk (personal) (work)
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005

    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:57:23 CST