RE: Subject: Re: 32'nd bit & UTF-8

From: Peter Constable (
Date: Wed Jan 19 2005 - 22:54:19 CST

  • Next message: Michael Everson: "Good news for Balinese"

    > From: []
    > Behalf Of Hans Aberg

    > It is just that it is in effect a file encoding format, not a
    > encoding format, originally tied to the MS OS. Unicode should not
    > any specific OS over another. Plain text files do not have a BOM,

    I've generally been deleting all this blather -- seems like every year
    and a half or someone comes along raising a ruckus about UTF-8 -- so
    perhaps this has been said; if so, please forgive the duplication.

    The suggestion that Unicode is promoting a specific OS, specifically
    Windows, based on statements in the standard related to UTF-8 is hard to
    take seriously given that that OS does not itself use UTF-8 in its file
    system, in its shell, nor by default in any of its internal operations
    or APIs (some APIs, such as WideCharToMultiByte, can be coerced into
    passing UTF-8).

    As for whether plain text files can have a BOM, that is one of the few
    unending debates that arise with certain (fortunately not too freguent)
    regularity, each time with vociferous expressions of deeply-held beliefs
    but never any resolution. I'll just observe that the formal grammar for
    XML does not make reference to a BOM, yet the XML spec certainly assumes
    that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in
    any Unicode encoding form/scheme). Rather than have a philosophical
    debate about the definition of "plain text file", I suggest a more
    pragmatic approach: for better or worse, plain text processes that
    support UTF-8 are going to encounter UTF-8 data beginning with a BOM:
    learn to live with it!

    (Now I'll give advance notice: I'll probably resume deleting this thread
    on first sight, do don't take it personally if I don't respond to a

    Peter Constable

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 22:55:01 CST