Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Jan 14 2004 - 12:52:41 EST

Next message: Philippe Verdy: "Re: corporate/users PUA ranges"

Previous message: Edward H. Trager: "Re: German characters not correct in output webform"
In reply to: Mark Davis: "Re: Detecting encoding in Plain text"
Next in thread: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 14/01/2004 09:25, Mark Davis wrote:

>I'm not sure which "one suggested heuristic method" you are referring to, ...
>
Basically the one that in UTF-16 there are likely to be many zero bytes
in either odd or even positions.

>... but
>you are bounding to conclusions. For example, one of the heuristics is to judge
>what are more common characters when bytes are interpreted as if they were in
>different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
>*still* much more common than U+2000, even in Thai.
>
>
>
Not necessarily. In certain texts neither might occur at all, so the
heuristic fails.

I agree with Mark S and others that more sophisticated methods are
likely to be safer.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Philippe Verdy: "Re: corporate/users PUA ranges"
Previous message: Edward H. Trager: "Re: German characters not correct in output webform"
In reply to: Mark Davis: "Re: Detecting encoding in Plain text"
Next in thread: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 13:37:22 EST