From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jan 20 2005 - 05:48:12 CST
On 20/01/2005 01:55, Hans Aberg wrote:
> ...
>
>It is just that it is in effect a file encoding format, not a character
>encoding format, originally tied to the MS OS. Unicode should not promote
>any specific OS over another. Plain text files do not have a BOM, period.
>
>
>
On the contrary, the Unicode standard defines that a BOM should be used
at the start of a plain text file under certain circumstances.
> ...
>
>So then one in effect has to rewrite the whole UNIX operative system, in
>order to ensure that and UTF-8 compliance. ...
>
Well, hardly the whole OS. One approach, if a system locale is UTF-8, is
to rewrite the file handling only so that any file opened in text mode
starting with the BOM signature in any of the standard UTFs is converted
to BOM-less UTF-8 before being presented to higher levels. The
implication of this is that any text data within the system is UTF-8.
> ...
>
>Right now, this is so. But clearly implementors of UNIX will not rewrite the
>whole OS just to accommodate a single inhouse file format on another
>platform, ...
>
We are not talking about any inhouse file format. We are talking about a
standard file format specified in the Unicode standard. You may not like
this format being in the standard. You could perhaps even campaign for
it ti be removed, although of course that doesn't change legacy data.
But while it is in the standard, it is a standard file format.
>... just as MS would not have rewritten its OS if Unicode dictated
>that the \r\n combination to be illegal in UTF-8 files..
>
>
What makes you think this? I guess MS would have opposed this change,
and for good reasons. But MS has a good record of implementing the
standard as specified, and it would not have been hard for them to
support an alternative line breaking sequence, at least in files which
are known to be UTF-8 e.g. from a BOM.
I wish MS would support alternative line breaking sequences, and treat
\n or \r in isolation as equivalent to \r\n. But Unicode has chosen not
to get involved in this issue. And I wish that Unix would similarly
support MS and other line breaking sequences.
>
>
> ...
>
>>Well, maybe, or maybe as something like "the sequence <i diaeresis,
>>guillemet, inverted question mark> ’ÄúˆØ ¬ª ¬ø’Äù ", ...
>>
I note from this mojibake that your system does not support UTF-8
properly even without a BOM. By the way, at this point I was assuming
that your "lexers that are made for ASCII data" would not choke
completely on bytes above 0x7F but would at least implicitly interpret
them according to some code page, because otherwise there is not way
that they can get anywhere near supporting UTF-8.
>>... to quote the same page of
>>the Unicode standard. If so, I'm sorry to say, so much for your old
>>program, you need to upgrade to the world of Unicode.
>>
>>
>
>But UNIX programs should not need to be updated because of an MS inhouse
>file format.
>
>
>
Indeed. But they do need to be updated because of a Unicode standard
file format.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:30:33 CST