Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Wed Jan 19 2005 - 17:51:29 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/19 21:37, Peter Kirk at wrote:

    >>> Maybe. Nevertheless, they exist, not only as a result of unintelligent
    >>> conversion from UTF-16 or UTF-32 to UTF-8, but also because at least one
    >>> UTF-8 editor, Notepad on Windows 2000 (and XP?), always emits a BOM at
    >>> the start of a UTF-8 file.

    >> Well, it seems easier to change that single editor, then. ...

    > It's not easy to change a program with an installed base in the hundreds
    > of millions worldwide! But I suppose it could be done as part of a
    > Windows service pack etc.

    It would be strange if one MS couldn't provide an upgrade for such a small
    software change, especially since one updates all other software.

    > But that assumes that everyone would agree that this change would be a
    > good idea. Oliver doesn't, and he makes a good point.

    Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
    so it is not going to honored there anyway.

    >> ... Or write a program
    >> that removes it at need. Note however that most tools will just act on byte
    >> streams. If there is a generated lexer involved, if correctly written, it
    >> will generate an error for anything that is not correct. On the BOM
    >> question, some fellows simply wants the BOM's to be ignored.

    > I thought everyone was required to ignore BOM's, as soon as the encoding
    > has been determined.

    The problem is that UNIX software looks at the first bytes to determine if
    it is a shell script. This relies on the special property of the original
    UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
    has this ASCII compatibility property. And lexers that are made for ASCII
    data will most likely treat a BOM as an error.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 17:52:44 CST