Re: Detecting encoding in Plain text

From: Tex Texin (tex@i18nguy.com)
Date: Thu Jan 08 2004 - 12:39:23 EST

Next message: Chris Pratley: "RE: Detecting encoding in Plain text"

Previous message: Patrick Andries: "Re: Detecting encoding in Plain text"
In reply to: D. Starner: "Re: Detecting encoding in Plain text"
Next in thread: Jungshik Shin: "Re: Detecting encoding in Plain text"
Reply: Jungshik Shin: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There were also papers on the subject at past unicode conferences.
Look for one by Martin Duerst several years ago and one by Kat Momoi, Netscape
only a few years back.
I think both are on the web.

Also look at the Netscape open source code. I believe it does some detection.

However, accuracy can be greatly improved if you or the end-user can supply
some information about the likely nature of the data (language, platform, most
likely encoding possibilities, file formats, data format or content information
e.g. field of expertise, etc.)

tex

"D. Starner" wrote:
>
> > Given any sizeable chunk of text, it ought to be possible to estimate
> > the statistical likelihood of its being in a certain
> > encoding/[language] even if it's in an unspecified 8859-* encoding.
> > It would be quite an interesting exercise, but I'd be surprised if
> > someone hasn't done it before. Perhaps someone here knows.
>
> http://www.let.rug.nl/~vannoord/TextCat/ has a paper on the subject
> and an implemenation in Perl. http://mnogosearch.org has an alternate
> implementation in compiled code (called mguesser).
> --
> ___________________________________________________________
> Sign-up for Ads Free at Mail.com
> http://promo.mail.com/adsfreejump.htm

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------

Next message: Chris Pratley: "RE: Detecting encoding in Plain text"
Previous message: Patrick Andries: "Re: Detecting encoding in Plain text"
In reply to: D. Starner: "Re: Detecting encoding in Plain text"
Next in thread: Jungshik Shin: "Re: Detecting encoding in Plain text"
Reply: Jungshik Shin: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 08 2004 - 13:27:00 EST