Re: Detecting encoding in Plain text

From: Doug Ewell (
Date: Tue Jan 13 2004 - 11:34:32 EST

  • Next message: John Jenkins: "Re: Chinese rod numerals"

    Peter Kirk <peterkirk at qaya dot org> wrote:

    >> If a certain Unicode plain text file uses ASCII punctuation OR spaces
    >> OR end-of-line characters, AND the file is not too short or has a
    >> very odd formatting, then the algorithm should work.
    > True. But there may be certain languages (perhaps Thai?) for which all
    > of these circumstances regularly occur together. It would be very
    > inconvenient for users of these languages if programs regularly
    > attribute the wrong encoding to their text.

    Whether this is specifically true for Thai or not -- and I doubt that
    the "short file or odd formatting" condition could ever be considered
    language-dependent -- I would say an otherwise-good heuristic that
    performs badly for Thai ought to have special cases built in for Thai,
    rather than being discarded.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 12:14:48 EST