Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Thu Jan 20 2005 - 08:14:27 CST

  • Next message: Hans Aberg: "Re: UTF-8 'BOM'"

    On 2005/01/20 12:48, Peter Kirk at wrote:

    >> It is just that it is in effect a file encoding format, not a character
    >> encoding format, originally tied to the MS OS. Unicode should not promote
    >> any specific OS over another. Plain text files do not have a BOM, period.

    > On the contrary, the Unicode standard defines that a BOM should be used
    > at the start of a plain text file under certain circumstances.

    This is then a Unicode text file, not a plain text file. Calling it a plain
    text file by Unicode just adds to the confusion.

    It appears that Unicode takes old concepts, alter the definitions of them,
    but retains the old name. This is going to create mess. When introducing new
    concepts, Unicode should choose new names.

    >> So then one in effect has to rewrite the whole UNIX operative system, in
    >> order to ensure that and UTF-8 compliance. ...

    > Well, hardly the whole OS. One approach, if a system locale is UTF-8, is
    > to rewrite the file handling only so that any file opened in text mode
    > starting with the BOM signature in any of the standard UTFs is converted
    > to BOM-less UTF-8 before being presented to higher levels. The
    > implication of this is that any text data within the system is UTF-8.

    This would just be one of the problems. The WWW-page quoted other problems.

    >> Right now, this is so. But clearly implementors of UNIX will not rewrite the
    >> whole OS just to accommodate a single inhouse file format on another
    >> platform, ...

    > We are not talking about any inhouse file format. We are talking about a
    > standard file format specified in the Unicode standard.

    The motivation for introducing it into Unicode was that a single MS text
    processor used this in order to identify file contents, something not used
    on other platforms. So whereas this is clearly a part of the Unicode
    standard, it is a de facto MS inhouse file format until others start to use

    >You may not like
    > this format being in the standard. You could perhaps even campaign for
    > it ti be removed, although of course that doesn't change legacy data.
    > But while it is in the standard, it is a standard file format.

    So I do suggest it to be removed. If file contents indicators are needed,
    they should be of general type. One could for example decide that \xFFFE and
    \xFFFF delimit file/stream format indicators. Then there is no requirement
    to have them, but suitable software could require it.

    As standard, Unicode will have to fight for recognition. Introducing things
    like the UTF-8 BOM requirement makes it more difficult for Unicode to earn
    that recognition.

    >> ... just as MS would not have rewritten its OS if Unicode dictated
    >> that the \r\n combination to be illegal in UTF-8 files..

    > What makes you think this?

    You probably to study this company a bit more. :-)

    >I guess MS would have opposed this change,
    > and for good reasons.

    Just as I, and others will, oppose the UTF-8 BOM requirement for good

    >But MS has a good record of implementing the
    > standard as specified,

    Another poster just said the opposite.

    >and it would not have been hard for them to
    > support an alternative line breaking sequence, at least in files which
    > are known to be UTF-8 e.g. from a BOM.
    > I wish MS would support alternative line breaking sequences, and treat
    > \n or \r in isolation as equivalent to \r\n. But Unicode has chosen not
    > to get involved in this issue. And I wish that Unix would similarly
    > support MS and other line breaking sequences.

    You are drawing this analogue too far, because it is fairly easy to fix the
    \r\n problem, whereas the BOM problem runs deeper. The latter changes the
    very paradigm for file representation.

    >>> Well, maybe, or maybe as something like "the sequence <i diaeresis,
    >>> guillemet, inverted question mark> ", ...
    > I note from this mojibake that your system does not support UTF-8
    > properly even without a BOM.

    This is not a quote from me. My mail should be in ASCII, as is a usual
    requirement of technical lists.

    >By the way, at this point I was assuming
    > that your "lexers that are made for ASCII data" would not choke
    > completely on bytes above 0x7F but would at least implicitly interpret
    > them according to some code page, because otherwise there is not way
    > that they can get anywhere near supporting UTF-8.

    As for Flex, one can choose a 7-bit or a 8-bit lexer generation mode. Quite
    naturally, one would choose the 8-bit mode for UTF-8 parsing.

    >> But UNIX programs should not need to be updated because of an MS inhouse
    >> file format.

    > Indeed. But they do need to be updated because of a Unicode standard
    > file format.

    As the Unicode standard stands now, yes, in view of that Unicode has adapted
    an MS inhouse file format as a part of its standard. But Unicode should not
    favor a particular platform this way.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:16:36 CST