Re: Names for UTF-8 with and without BOM

From: Tex Texin (tex@i18nguy.com)
Date: Sat Nov 02 2002 - 19:03:53 EST

  • Next message: Tex Texin: "Re: Names for UTF-8 with and without BOM"

    Hi John,
    I meant the character "<".

    As for notepad, what I should have either stated more completely or bit
    my tongue, is that where there is a standard in place (and where it is
    unambiguous) the mistakes of particular products shouldn't hold sway,
    unless they are tantamount to a de facto standard. I (personally) don't
    hold notepad in that class. In particular with respect to Michka's
    comment that parsers should upgrade to accommodate notepad's BOM, I
    rather thought notepad should be changed. But I certainly don't want to
    get into a debate on notepad's influence on the market, so let's pretend
    I bit my tongue in the last mail, and once again in this mail. ;-)

    tex

    John Cowan wrote:
    >
    > Tex Texin scripsit:
    >
    > > I didn't think the XML standard allowed for utf-8 files to have a BOM.
    >
    > This capability was never actually excluded, and was added by erratum
    > (and force-majeure, when it became clear that BOMful UTF-8 was going to
    > start becoming common). XML files are intended to be plain text, and
    > if a large source of plain text insists on a BOM, so be it.
    >
    > > The standard is quite clear about requiring 0xFEFF for utf-16.
    > > I would have thought a proper parser would reject a non-utf-16 file
    > > beginning with something other than "<".
    >
    > If by "<" you mean the *character* "<", then yes. If you mean the *byte*
    > 0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32),
    > 0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16
    > in BE order), or 0xFF (UTF-16 in LE order). In principle they could begin with
    > some other byte: 0x2B in UTF-7, e.g.
    >
    > > (The fact that notepad puts it there should be irrelevant.)
    >
    > Actual practice is never quite irrelevant.
    >
    > --
    > John Cowan jcowan@reutershealth.com http://www.reutershealth.com
    > "Mr. Lane, if you ever wish anything that I can do, all you will have
    > to do will be to send me a telegram asking and it will be done."
    > "Mr. Hearst, if you ever get a telegram from me asking you to do
    > anything, you can put the telegram down as a forgery."

    -- 
    -------------------------------------------------------------
    Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
    Xen Master                          http://www.i18nGuy.com
                             
    XenCraft		            http://www.XenCraft.com
    Making e-Business Work Around the World
    -------------------------------------------------------------
    


    This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 19:48:47 EST