Re: Detecting encoding in Plain text

From: jon@hackcraft.net
Date: Thu Jan 08 2004 - 07:09:01 EST

  • Next message: D. Starner: "Re: Detecting encoding in Plain text"

    > I writing a small tool to get text from a txt file into a edit box.
    > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
    > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
    > My problem is that I can distinguish between UTF-8 or UTF-16 using the BOM.
    > But how do I auto detect the others.
    > Any kind of help will be appreciated.

    There is no foolproof way of differentiating between some of the encodings.
    While UTF-16 or UTF-8 with a BOM (such files don't necessarily start with a BOM
    by the way) "stand out" as being unlikely to be in any other encoding others
    are more troublesome.

    If there is no source of encoding information (such as you get with xml
    declarations, HTTP headers and such), and even if there is, it may be best to
    offer your users the ability to select encodings (perhaps with the default
    choice based on locale settings).

    --
    Jon Hanna
    <http://www.hackcraft.net/>
    *Thought provoking quote goes here*
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 08 2004 - 08:58:48 EST