From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Jan 22 2005 - 11:58:46 CST
On 22/01/2005 16:50, Lars Kristan wrote:
> ...
>
> > ... system's default code page. This cannot be
> > UTF-8, and so these files cannot start with a BOM
>
> Actually, they're not that far from it. Try "mode CON CP 
> SELECT=65000". It is unsupported. Why?
>
> ...
>
> Now consider that user's (!) default code page is UTF-8 (so 65000). 
> You would get proper output and no dropping for Unicode data. But what 
> happens is that applications start dropping data on the stdin. Because 
> invalid sequences are dropped. And with dropped I make no distinction 
> between skipping them and replacing them with U+FFFD. It is dropping data.
>
> It would be nice to have UTF-8 as a default code page, wouldn't it? 
> Someone must have realized that dropping data on the stdin is more 
> than users would be willing to accept. Well, we can wait a couple of 
> years to get all the out of band data sorted out. Or clutter 
> everything with BOMs. Maybe then we'll know when the data is UTF-8 and 
> when it is not. Maybe we will, maybe we won't. How about defining how 
> to convert invalid UTF-8 sequences to codepoints? It would start 
> working. Indeed no better than things work today. But the "current 
> code page" concept did not differentiate between different encodings. 
> Why should we differentiate UTF-8 from the rest? Of course it would be 
> useful, but can it be done reliably? Can it be done in near future?
>
>
This is interesting speculation. But with any code page there are bytes 
or combinations of bytes which are illegal or undefined in that code 
page. When Windows (NT/2000/XP and so internally Unicode, represented as 
UTF-16) reads code page files as text, they are converted to Unicode. 
The correct behaviour when an illegal or undefined byte is found is to 
replace it with U+FFFD, and I think this is what Windows does. This you 
might also call dropping of data, although in fact it is not data but 
garbage, or data wrongly labelled and so misinterpreted as garbage.
And if, speculatively, Windows were to support UTF-8 as a code page, the 
situation would be unchanged. Byte sequences which are illegal UTF-8 are 
garbage in that code page and so would correctly be replaced by U+FFFD.
But then even if UTF-8 were supported as a code page I think I would 
keep Windows 1252 as my system code page. There is too much Windows 1252 
legacy data around which would be treated as garbage if UTF-8 were my 
system code page. The code page is used only by obsolescent legacy 
applications, and by modern applications reading legacy data. Windows 
Unicode support is adequate without trying to reinterpret legacy data as 
Unicode. And rather than try to trick old applications into supporting 
Unicode through UTF-8, the Windows strategy has rightly been to update 
the applications for proper Unicode support.
...
> ... Very Windows-like. Much like hiding the extensions in Explorer. ...
>
This is optional. An option which anyone who knows anything much about 
computers should switch off.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005
This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:57:23 CST