Re: Names for UTF-8 with and without BOM

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Nov 01 2002 - 17:28:28 EST

  • Next message: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"

    That is not sufficient. The first three bytes could represent a real content
    character, ZWNBSP or they could be a BOM. The label doesn't tell you.

    This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE 0xFF
    represents a BOM, and is not part of the content. In the second case, it
    does *not* represent a BOM -- it represents a ZWNBSP, and must not be
    stripped. The difference here is that the encoding name tells you exactly
    what the situation is.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Murray Sargent" <murrays@exchange.microsoft.com>
    To: "Joseph Boyle" <Boyle@siebel.com>
    Cc: <unicode@unicode.org>
    Sent: Friday, November 01, 2002 12:42
    Subject: RE: Names for UTF-8 with and without BOM

    > Joseph Boyle says: "It would be useful to have official names to
    > distinguish UTF-8 with and without BOM."
    >
    > To see if a UTF-8 file has no BOM, you can just look at the first three
    > bytes. Is this a problem? Typically when you care about a file's
    > encoding form, you plan to read the file.
    >
    > Thanks
    > Murray
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 07:18:46 EST