RE: Conformance (was UTF, BOM, etc)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 12:41:23 CST

  • Next message: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"

    Peter Kirk wrote:

    > This is interesting speculation. But with any code page there
    > are bytes
    > or combinations of bytes which are illegal or undefined in that code
    > page.
    In most SBCS encodings, there are none. Those that are, typically do not
    occurr. This is why we're not used to that problem. And it is why we won't
    be happy if that problem all of a sudden pops up every day. Once you attempt
    to mix UTF-8 and legacy encodings, yes, illegal sequences pop up
    immediately.

    > When Windows (NT/2000/XP and so internally Unicode,
    > represented as
    > UTF-16) reads code page files as text, they are converted to Unicode.
    > The correct behaviour when an illegal or undefined byte is
    > found is to
    > replace it with U+FFFD, and I think this is what Windows
    > does.

    I very much doubt it. The conversion not only does not do that, it is also
    buggy in other ways, if I remember correctly. I think it also consumes
    _valid_ characters following invalid sequences. Maybe I don't have all the
    patches, but...

    > And if, speculatively, Windows were to support UTF-8 as a
    > code page, the
    > situation would be unchanged. Byte sequences which are
    > illegal UTF-8 are
    > garbage in that code page and so would correctly be replaced
    > by U+FFFD.

    Which is exactly what needs to be changed. 128 codepoints, remember?

    >
    > But then even if UTF-8 were supported as a code page I think I would
    > keep Windows 1252 as my system code page. There is too much
    > Windows 1252
    > legacy data around which would be treated as garbage

    As long as you believe it should be treated as garbage. Perhaps if this is
    changed, then you would find it useful enough for your needs.

    Microsoft can provide all UTF-16 applications. But the console can only be
    improved by using UTF-8. This is the only solution that also works with
    existing applications.

    Lars



    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:43:23 CST