RE: Conformance (was UTF, BOM, etc)

From: Lars Kristan ([email protected])
Date: Sat Jan 22 2005 - 12:41:23 CST

Next message: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"

Previous message: Doug Ewell: "UCData (was: Re: The "JDGI" file grows)"
Maybe in reply to: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Next in thread: Doug Ewell: "Re: Conformance (was UTF, BOM, etc)"
Reply: Doug Ewell: "Re: Conformance (was UTF, BOM, etc)"
Reply: Jon Hanna: "RE: Conformance (was UTF, BOM, etc)"
Reply: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"
Reply: Doug Ewell: "Re: Conformance (was UTF, BOM, etc)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Kirk wrote:

> This is interesting speculation. But with any code page there
> are bytes
> or combinations of bytes which are illegal or undefined in that code
> page.
In most SBCS encodings, there are none. Those that are, typically do not
occurr. This is why we're not used to that problem. And it is why we won't
be happy if that problem all of a sudden pops up every day. Once you attempt
to mix UTF-8 and legacy encodings, yes, illegal sequences pop up
immediately.

> When Windows (NT/2000/XP and so internally Unicode,
> represented as
> UTF-16) reads code page files as text, they are converted to Unicode.
> The correct behaviour when an illegal or undefined byte is
> found is to
> replace it with U+FFFD, and I think this is what Windows
> does.

I very much doubt it. The conversion not only does not do that, it is also
buggy in other ways, if I remember correctly. I think it also consumes
_valid_ characters following invalid sequences. Maybe I don't have all the
patches, but...

> And if, speculatively, Windows were to support UTF-8 as a
> code page, the
> situation would be unchanged. Byte sequences which are
> illegal UTF-8 are
> garbage in that code page and so would correctly be replaced
> by U+FFFD.

Which is exactly what needs to be changed. 128 codepoints, remember?

>
> But then even if UTF-8 were supported as a code page I think I would
> keep Windows 1252 as my system code page. There is too much
> Windows 1252
> legacy data around which would be treated as garbage

As long as you believe it should be treated as garbage. Perhaps if this is
changed, then you would find it useful enough for your needs.

Microsoft can provide all UTF-16 applications. But the console can only be
improved by using UTF-8. This is the only solution that also works with
existing applications.

Lars

Next message: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"
Previous message: Doug Ewell: "UCData (was: Re: The "JDGI" file grows)"
Maybe in reply to: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Next in thread: Doug Ewell: "Re: Conformance (was UTF, BOM, etc)"
Reply: Doug Ewell: "Re: Conformance (was UTF, BOM, etc)"
Reply: Jon Hanna: "RE: Conformance (was UTF, BOM, etc)"
Reply: Peter Kirk: "Re: Conformance (was UTF, BOM, etc)"
Reply: Doug Ewell: "Re: Conformance (was UTF, BOM, etc)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:43:23 CST