Re: detecting encoding in plain text (related to utf8)

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Jan 14 2004 - 02:41:46 EST

  • Next message: Mustafa Jabbar: "RE: New MS Mac Office and Unicode?"

    Deepak Chand Rathore <deepakr at aztec dot soft dot net> wrote:

    > But, there is one concern. In some cases the utf8 byte stream starts
    > with a BOM,( for eg. when we try reading bytes from a text file that
    > is saved using notepad (using utf8 option )in WIN2k, after first few
    > bytes( i suppose first 3 bytes), the actual text start.
    > So how do we detect whether the byte stream starts with a BOM or
    > not ??
    > or the first few bytes represent BOM or the actual text ??

    What you are asking is, if a UTF-8 byte stream starts with the character
    U+FEFF, should that character be treated as a signature (BOM) or as a
    zero-width no-break space?

    You'll probably get different responses to this, having to do with
    tagging or streams broken in the middle. My view is that a zero-width
    no-break space has *no business* appearing at the start of a text
    stream. With no character to precede it, what would it prevent a break
    between? U+FEFF, or specifically the bytes EF BB BF, at the true start
    of a UTF-8 stream should be always interpreted as a signature.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/
     I don't speak for the Unicode Consortium.



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 03:13:26 EST