Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Jan 13 2004 - 12:03:48 EST

  • Next message: Markus Scherer: "Re: German characters not correct in output webform"

    On 13/01/2004 08:34, Doug Ewell wrote:

    >Peter Kirk <peterkirk at qaya dot org> wrote:
    >
    >
    >
    >>>If a certain Unicode plain text file uses ASCII punctuation OR spaces
    >>>OR end-of-line characters, AND the file is not too short or has a
    >>>very odd formatting, then the algorithm should work.
    >>>
    >>>
    >>True. But there may be certain languages (perhaps Thai?) for which all
    >>of these circumstances regularly occur together. It would be very
    >>inconvenient for users of these languages if programs regularly
    >>attribute the wrong encoding to their text.
    >>
    >>
    >
    >Whether this is specifically true for Thai or not -- and I doubt that
    >the "short file or odd formatting" condition could ever be considered
    >language-dependent -- I would say an otherwise-good heuristic that
    >performs badly for Thai ought to have special cases built in for Thai,
    >rather than being discarded.
    >
    >
    >
    >
    I may have confused you with what I wrote, but my "all of these
    circumstances" referred not to "the "short file or odd formatting"
    condition", but to Marco's "*all* these circumstances", which you
    snipped, which were originally:

    >Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
    >

    I agree that heuristics should be adjusted for Thai. But problems may
    arise if they have to be adjusted individually, and without regression
    errors, for all 6000+ world languages.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 13:03:37 EST