Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Jan 14 2004 - 12:52:41 EST

  • Next message: Philippe Verdy: "Re: corporate/users PUA ranges"

    On 14/01/2004 09:25, Mark Davis wrote:

    >I'm not sure which "one suggested heuristic method" you are referring to, ...
    >
    Basically the one that in UTF-16 there are likely to be many zero bytes
    in either odd or even positions.

    >... but
    >you are bounding to conclusions. For example, one of the heuristics is to judge
    >what are more common characters when bytes are interpreted as if they were in
    >different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
    >*still* much more common than U+2000, even in Thai.
    >
    >
    >
    Not necessarily. In certain texts neither might occur at all, so the
    heuristic fails.

    I agree with Mark S and others that more sophisticated methods are
    likely to be safer.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 13:37:22 EST