Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 06:51:10 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/20 05:54, Peter Constable at petercon@microsoft.com wrote:

    >> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    >> On
    >> Behalf Of Hans Aberg

    >> It is just that it is in effect a file encoding format, not a
    >> character
    >> encoding format, originally tied to the MS OS. Unicode should not
    >> promote
    >> any specific OS over another. Plain text files do not have a BOM,
    >> period.
    >
    > I've generally been deleting all this blather -- seems like every year
    > and a half or someone comes along raising a ruckus about UTF-8 -- so
    > perhaps this has been said; if so, please forgive the duplication.
    >
    > The suggestion that Unicode is promoting a specific OS, specifically
    > Windows, based on statements in the standard related to UTF-8 is hard to
    > take seriously given that that OS does not itself use UTF-8 in its file
    > system, in its shell, nor by default in any of its internal operations
    > or APIs (some APIs, such as WideCharToMultiByte, can be coerced into
    > passing UTF-8).

    I think this is wrong: UNIX version now appear which are in effect
    processing UTF-8 without the BOM. Relatively minor changes are needed, as
    these OS's already knows how t process 8-bit bytes.

    > As for whether plain text files can have a BOM, that is one of the few
    > unending debates that arise with certain (fortunately not too freguent)
    > regularity, each time with vociferous expressions of deeply-held beliefs
    > but never any resolution.

    So why do you keep it, when relatively minor changes to the standard would
    make people happy, making all those discussions as well as all extra
    programmer work that Unicode now requires go away?

    >I'll just observe that the formal grammar for
    > XML does not make reference to a BOM, yet the XML spec certainly assumes
    > that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in
    > any Unicode encoding form/scheme). Rather than have a philosophical
    > debate about the definition of "plain text file", I suggest a more
    > pragmatic approach: for better or worse, plain text processes that
    > support UTF-8 are going to encounter UTF-8 data beginning with a BOM:
    > learn to live with it!

    If the question were only about XML-browsers and text editors, then the BOM
    would probably be of little problem. But UNIX programs will not be executed
    properly with it.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 06:52:46 CST