From: D. Starner (firstname.lastname@example.org)
Date: Tue Jan 13 2004 - 20:55:57 EST
----- Original Message -----
From: Peter Kirk <email@example.com>
Date: Tue, 13 Jan 2004 09:03:48 -0800
To: Doug Ewell <firstname.lastname@example.org>
Subject: Re: Detecting encoding in Plain text
> On 13/01/2004 08:34, Doug Ewell wrote:
> >Peter Kirk <peterkirk at qaya dot org> wrote:
> >>>If a certain Unicode plain text file uses ASCII punctuation OR spaces
> >>>OR end-of-line characters, AND the file is not too short or has a
> >>>very odd formatting, then the algorithm should work.
> >>True. But there may be certain languages (perhaps Thai?) for which all
> >>of these circumstances regularly occur together. It would be very
> >>inconvenient for users of these languages if programs regularly
> >>attribute the wrong encoding to their text.
> >Whether this is specifically true for Thai or not -- and I doubt that
> >the "short file or odd formatting" condition could ever be considered
> >language-dependent -- I would say an otherwise-good heuristic that
> >performs badly for Thai ought to have special cases built in for Thai,
> >rather than being discarded.
> I may have confused you with what I wrote, but my "all of these
> circumstances" referred not to "the "short file or odd formatting"
> condition", but to Marco's "*all* these circumstances", which you
> snipped, which were originally:
> >Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
> I agree that heuristics should be adjusted for Thai. But problems may
> arise if they have to be adjusted individually, and without regression
> errors, for all 6000+ world languages.
> Peter Kirk
> email@example.com (personal)
> firstname.lastname@example.org (work)
-- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 02:12:39 EST