Re: Detecting encoding in Plain text

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Jan 14 2004 - 12:25:13 EST

Next message: Michael Everson: "Re: New MS Mac Office and Unicode?"

Previous message: Markus Scherer: "Re: detecting encoding in plain text (related to utf8)"
In reply to: Peter Kirk: "Re: Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm not sure which "one suggested heuristic method" you are referring to, but
you are bounding to conclusions. For example, one of the heuristics is to judge
what are more common characters when bytes are interpreted as if they were in
different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
*still* much more common than U+2000, even in Thai.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Peter Kirk" <peterkirk@qaya.org>
To: "John Burger" <john@mitre.org>
Cc: <unicode@unicode.org>
Sent: Wed, 2004 Jan 14 08:12
Subject: Re: Detecting encoding in Plain text

> On 14/01/2004 07:16, John Burger wrote:
>
> > ...
> > By the way, I still don't quite understand what's special about Thai.
> > Could someone elaborate?
> >
> I mentioned Thai because it is the only language I know of which does
> not used SPACE, U+0020. It also has at least some of its own
> punctuation. So a Thai text need not include any characters U+00xx -
> which rules out one suggested heuristic method.
>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>
>

Next message: Michael Everson: "Re: New MS Mac Office and Unicode?"
Previous message: Markus Scherer: "Re: detecting encoding in plain text (related to utf8)"
In reply to: Peter Kirk: "Re: Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 12:57:56 EST