RE: Detecting encoding in Plain text

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jan 13 2004 - 07:10:36 EST

Next message: Peter Kirk: "Re: Detecting encoding in Plain text"

Previous message: Peter Kirk: "Re: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Kirk wrote:
> >What do you mean by "dangerous"? This is an heuristic
> >algorithm, so it is only supposed to work always [...]

(I meant: "it is not supposed to work always")

> I would not consider an 80% algorithm to be very good -
> depending on the circumstances etc. But if for example 20% of
> my incoming e-mails were detected with the wrong encoding and
> appeared on my screen as junk, [...]

In this case (as in most other similar cases), you should rather blame the
people who send you e-mail without encoding declaration.

Auto-detection should be the last resort, when you have no safest way of
determining the encoding.

> >Yes, but *all* these circumstances must occur together in
> >order for the algorithm to be totally useless for *that*
> >language. [...]
> >
> True. But there may be certain languages (perhaps Thai?) for
> which all of these circumstances regularly occur together.

I don't think that Thai would be such a case. Thai normally uses European
digits (the usage scope of Thai digits is probably similar to that of Roman
numerals in Western languages), some European punctuation (parentheses,
exclamation marks, hyphens, quotes), and spaces (although a Thai space has
the strength -- and hence the frequency -- of a Western semicolon).

As a minimum, all languages should use line feed and/or new line as line
terminators, as Unicode's line and paragraph separators never caught on.

_ Marco

Next message: Peter Kirk: "Re: Detecting encoding in Plain text"
Previous message: Peter Kirk: "Re: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 07:45:43 EST