Re: Detecting encoding in Plain text

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Jan 14 2004 - 12:25:13 EST

  • Next message: Michael Everson: "Re: New MS Mac Office and Unicode?"

    I'm not sure which "one suggested heuristic method" you are referring to, but
    you are bounding to conclusions. For example, one of the heuristics is to judge
    what are more common characters when bytes are interpreted as if they were in
    different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
    *still* much more common than U+2000, even in Thai.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Peter Kirk" <peterkirk@qaya.org>
    To: "John Burger" <john@mitre.org>
    Cc: <unicode@unicode.org>
    Sent: Wed, 2004 Jan 14 08:12
    Subject: Re: Detecting encoding in Plain text

    > On 14/01/2004 07:16, John Burger wrote:
    >
    > > ...
    > > By the way, I still don't quite understand what's special about Thai.
    > > Could someone elaborate?
    > >
    > I mentioned Thai because it is the only language I know of which does
    > not used SPACE, U+0020. It also has at least some of its own
    > punctuation. So a Thai text need not include any characters U+00xx -
    > which rules out one suggested heuristic method.
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 12:57:56 EST