Re: Detecting encoding in Plain text

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Wed Jan 14 2004 - 18:34:17 EST

  • Next message: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"

    Consider CR and LF too.

    Mark Davis wrote on 1/14/2004, 9:25 AM:

    > I'm not sure which "one suggested heuristic method" you are referring
    > to, but
    > you are bounding to conclusions. For example, one of the heuristics is
    > to judge
    > what are more common characters when bytes are interpreted as if they
    > were in
    > different encoding schemes. When picking between UTF16-BE and LE,
    > U+0020 is
    > *still* much more common than U+2000, even in Thai.
    >
    > Mark
    > __________________________________
    > http://www.macchiato.com
    > ► शिष्यादिच्छेत्पराजयम् ◄
    >
    > ----- Original Message -----
    > From: "Peter Kirk" <peterkirk@qaya.org>
    > To: "John Burger" <john@mitre.org>
    > Cc: <unicode@unicode.org>
    > Sent: Wed, 2004 Jan 14 08:12
    > Subject: Re: Detecting encoding in Plain text
    >
    >
    > > On 14/01/2004 07:16, John Burger wrote:
    > >
    > > > ...
    > > > By the way, I still don't quite understand what's special about Thai.
    > > > Could someone elaborate?
    > > >
    > > I mentioned Thai because it is the only language I know of which does
    > > not used SPACE, U+0020. It also has at least some of its own
    > > punctuation. So a Thai text need not include any characters U+00xx -
    > > which rules out one suggested heuristic method.
    > >
    > > --
    > > Peter Kirk
    > > peter@qaya.org (personal)
    > > peterkirk@qaya.org (work)
    > > http://www.qaya.org/
    > >
    > >
    > >
    > >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 19:02:26 EST