Re: Detecting encoding in Plain text

From: John Delacour (JD@BD8.COM)
Date: Thu Jan 08 2004 - 07:49:27 EST

Next message: jon@hackcraft.net: "Re: Detecting encoding in Plain text"

Previous message: Otto Stolz: "Long S in Germany (was: 0364 COMBINING LATIN SMALL LETTER E)"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Patrick Andries: "Re: Detecting encoding in Plain text"
Reply: Patrick Andries: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 12:09 pm +0000 8/1/04, jon@hackcraft.net wrote:

>There is no foolproof way of differentiating between some of the
>encodings. While UTF-16 or UTF-8 with a BOM (such files don't
>necessarily start with a BOM by the way) "stand out" as being
>unlikely to be in any other encoding others are more troublesome.

Given any sizeable chunk of text, it ought to be possible to estimate
the statistical likelihood of its being in a certain
encoding/[language] even if it's in an unspecified 8859-* encoding.
It would be quite an interesting exercise, but I'd be surprised if
someone hasn't done it before. Perhaps someone here knows.

Next message: jon@hackcraft.net: "Re: Detecting encoding in Plain text"
Previous message: Otto Stolz: "Long S in Germany (was: 0364 COMBINING LATIN SMALL LETTER E)"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Patrick Andries: "Re: Detecting encoding in Plain text"
Reply: Patrick Andries: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 08 2004 - 08:31:24 EST