Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jan 20 2005 - 05:48:12 CST

  • Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"

    On 20/01/2005 01:55, Hans Aberg wrote:

    > ...
    >
    >It is just that it is in effect a file encoding format, not a character
    >encoding format, originally tied to the MS OS. Unicode should not promote
    >any specific OS over another. Plain text files do not have a BOM, period.
    >
    >
    >
    On the contrary, the Unicode standard defines that a BOM should be used
    at the start of a plain text file under certain circumstances.

    > ...
    >
    >So then one in effect has to rewrite the whole UNIX operative system, in
    >order to ensure that and UTF-8 compliance. ...
    >

    Well, hardly the whole OS. One approach, if a system locale is UTF-8, is
    to rewrite the file handling only so that any file opened in text mode
    starting with the BOM signature in any of the standard UTFs is converted
    to BOM-less UTF-8 before being presented to higher levels. The
    implication of this is that any text data within the system is UTF-8.

    > ...
    >
    >Right now, this is so. But clearly implementors of UNIX will not rewrite the
    >whole OS just to accommodate a single inhouse file format on another
    >platform, ...
    >

    We are not talking about any inhouse file format. We are talking about a
    standard file format specified in the Unicode standard. You may not like
    this format being in the standard. You could perhaps even campaign for
    it ti be removed, although of course that doesn't change legacy data.
    But while it is in the standard, it is a standard file format.

    >... just as MS would not have rewritten its OS if Unicode dictated
    >that the \r\n combination to be illegal in UTF-8 files..
    >
    >

    What makes you think this? I guess MS would have opposed this change,
    and for good reasons. But MS has a good record of implementing the
    standard as specified, and it would not have been hard for them to
    support an alternative line breaking sequence, at least in files which
    are known to be UTF-8 e.g. from a BOM.

    I wish MS would support alternative line breaking sequences, and treat
    \n or \r in isolation as equivalent to \r\n. But Unicode has chosen not
    to get involved in this issue. And I wish that Unix would similarly
    support MS and other line breaking sequences.

    >
    >
    > ...
    >
    >>Well, maybe, or maybe as something like "the sequence <i diaeresis,
    >>guillemet, inverted question mark> ’ÄúˆØ ¬ª ¬ø’Äù ", ...
    >>

    I note from this mojibake that your system does not support UTF-8
    properly even without a BOM. By the way, at this point I was assuming
    that your "lexers that are made for ASCII data" would not choke
    completely on bytes above 0x7F but would at least implicitly interpret
    them according to some code page, because otherwise there is not way
    that they can get anywhere near supporting UTF-8.

    >>... to quote the same page of
    >>the Unicode standard. If so, I'm sorry to say, so much for your old
    >>program, you need to upgrade to the world of Unicode.
    >>
    >>
    >
    >But UNIX programs should not need to be updated because of an MS inhouse
    >file format.
    >
    >
    >
    Indeed. But they do need to be updated because of a Unicode standard
    file format.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:30:33 CST