RE: Detecting encoding in Plain text

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jan 13 2004 - 07:10:36 EST

  • Next message: Peter Kirk: "Re: Detecting encoding in Plain text"

    Peter Kirk wrote:
    > >What do you mean by "dangerous"? This is an heuristic
    > >algorithm, so it is only supposed to work always [...]

    (I meant: "it is not supposed to work always")

    > I would not consider an 80% algorithm to be very good -
    > depending on the circumstances etc. But if for example 20% of
    > my incoming e-mails were detected with the wrong encoding and
    > appeared on my screen as junk, [...]

    In this case (as in most other similar cases), you should rather blame the
    people who send you e-mail without encoding declaration.

    Auto-detection should be the last resort, when you have no safest way of
    determining the encoding.

    > >Yes, but *all* these circumstances must occur together in
    > >order for the algorithm to be totally useless for *that*
    > >language. [...]
    > >
    > True. But there may be certain languages (perhaps Thai?) for
    > which all of these circumstances regularly occur together.

    I don't think that Thai would be such a case. Thai normally uses European
    digits (the usage scope of Thai digits is probably similar to that of Roman
    numerals in Western languages), some European punctuation (parentheses,
    exclamation marks, hyphens, quotes), and spaces (although a Thai space has
    the strength -- and hence the frequency -- of a Western semicolon).

    As a minimum, all languages should use line feed and/or new line as line
    terminators, as Unicode's line and paragraph separators never caught on.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 07:45:43 EST