Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Jan 13 2004 - 06:22:27 EST

  • Next message: Marco Cimarosti: "RE: Detecting encoding in Plain text"

    On 13/01/2004 02:40, Marco Cimarosti wrote:

    >Peter Kirk wrote:
    >
    >
    >>This one also looks dangerous.
    >>
    >>
    >
    >What do you mean by "dangerous"? This is an heuristic algorithm, so it is
    >only supposed to work always but only in some lucky cases.
    >
    >If lucky cases average to, say, 20% or less then it is a bad and useless
    >algorithm; if they average to, say, 80% or more, then it is good and
    >useless. But you can't ask that it works in the 100% of cases, or it
    >wouldn't be heuristic anymore.
    >
    >
    >
    I would not consider an 80% algorithm to be very good - depending on the
    circumstances etc. But if for example 20% of my incoming e-mails were
    detected with the wrong encoding and appeared on my screen as junk, and
    I had to manually adjust the encoding, I would not be very happy. I
    would probably prefer a manual selection method e.g. from a list.

    >>Some scripts include their own
    >>digits and punctuation; not all scripts use spaces; and controls are not
    >>necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
    >>
    >>
    >
    >Yes, but *all* these circumstances must occur together in order for the
    >algorithm to be totally useless for *that* language.
    >
    >If a certain Unicode plain text file uses ASCII punctuation OR spaces OR
    >end-of-line characters, AND the file is not too short or has a very odd
    >formatting, then the algorithm should work.
    >
    >
    >
    True. But there may be certain languages (perhaps Thai?) for which all
    of these circumstances regularly occur together. It would be very
    inconvenient for users of these languages if programs regularly
    attribute the wrong encoding to their text.

    >>But there may be some characters U+??00 which are used rather
    >>commonly in a particular script and so occur commonly in
    >>some text files.
    >>
    >>
    >
    >And those text files will not be detected correctly, particularly if they
    >are very short: that's part of the game.
    >
    >
    >
    Even if they are very long, if they don't use Latin-1 at all as above.
    At least this shouldn't be a problem for Thai is U+0E00 is not used.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 07:32:59 EST