Re: Names for UTF-8 with and without BOM

From: Tex Texin (tex@i18nguy.com)
Date: Sat Nov 02 2002 - 17:24:17 EST

  • Next message: Stefan Persson: "Re: Header Reply-To"

    Thanks Doug. I had looked at the standard not at the appendix.

    I think that (non-normative) appendix is unfortunate. It seems to imply
    (to my mind) that if other character sets define BOMs that it is ok to
    use them as XML signatures.
    My reasoning is that the standard itself only says that UTF-16 must have
    a signature and everything else except utf-8 must declare their
    encoding. The standard doesn't say whether other encodings should or
    should not be allowed to use signatures. The appendix F by defining the
    other Unicode signatures implies they are acceptable (without
    specifically stating so).

    The text of the standard however doesn't suggest even that UCS-4 would
    use a signature, as it doesn't include it with utf-16 when speaking
    about it requiring a BOM, and specifically says the name of UCS-4 to use
    in the declaration, as with other encodings.

    However, that leaves open the question whether only the Unicode
    transform signatures are acceptable or other signatures are also
    allowed. So if a vendor defines a code page, and defines a signature
    (perhaps mapping BOM/ZWNSP specifically to some code point or byte
    string) does that then become acceptable?

    Of course we hope not, and I am sure the authors did not intend so, but
    without a statement about which signatures are allowed or not allowed
    beyond UTF-16, I think the can of worms is opened.

    OK, having raised the issue I'll take it up with the w3c i18n group to
    get their understanding and then the xml group if needed.

    tex

    Doug Ewell wrote:
    >
    > Tex Texin <tex at i18nguy dot com> wrote:
    >
    > > I didn't think the XML standard allowed for utf-8 files to have a BOM.
    > > The standard is quite clear about requiring 0xFEFF for utf-16.
    > > I would have thought a proper parser would reject a non-utf-16 file
    > > beginning with something other than "<".
    >
    > The standard explicitly allows UCS-4, UTF-16, and UTF-8 files to begin
    > with a BOM. See Appendix F.1, "Detection Without External Encoding
    > Information":
    >
    > http://www.w3.org/TR/REC-xml#sec-guessing
    >
    > -Doug Ewell
    > Fullerton, California

    -- 
    -------------------------------------------------------------
    Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
    Xen Master                          http://www.i18nGuy.com
                             
    XenCraft		            http://www.XenCraft.com
    Making e-Business Work Around the World
    -------------------------------------------------------------
    


    This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 18:01:49 EST