From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 12:41:23 CST
Peter Kirk wrote:
> This is interesting speculation. But with any code page there
> are bytes
> or combinations of bytes which are illegal or undefined in that code
> page.
In most SBCS encodings, there are none. Those that are, typically do not
occurr. This is why we're not used to that problem. And it is why we won't
be happy if that problem all of a sudden pops up every day. Once you attempt
to mix UTF-8 and legacy encodings, yes, illegal sequences pop up
immediately.
> When Windows (NT/2000/XP and so internally Unicode,
> represented as
> UTF-16) reads code page files as text, they are converted to Unicode.
> The correct behaviour when an illegal or undefined byte is
> found is to
> replace it with U+FFFD, and I think this is what Windows
> does.
I very much doubt it. The conversion not only does not do that, it is also
buggy in other ways, if I remember correctly. I think it also consumes
_valid_ characters following invalid sequences. Maybe I don't have all the
patches, but...
> And if, speculatively, Windows were to support UTF-8 as a
> code page, the
> situation would be unchanged. Byte sequences which are
> illegal UTF-8 are
> garbage in that code page and so would correctly be replaced
> by U+FFFD.
Which is exactly what needs to be changed. 128 codepoints, remember?
>
> But then even if UTF-8 were supported as a code page I think I would
> keep Windows 1252 as my system code page. There is too much
> Windows 1252
> legacy data around which would be treated as garbage
As long as you believe it should be treated as garbage. Perhaps if this is
changed, then you would find it useful enough for your needs.
Microsoft can provide all UTF-16 applications. But the console can only be
improved by using UTF-8. This is the only solution that also works with
existing applications.
Lars
This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:43:23 CST