Re: Detecting encoding in Plain text

From: Peter Kirk (
Date: Wed Jan 14 2004 - 07:33:34 EST

  • Next message: Peter Kirk: "Re: New MS Mac Office and Unicode?"

    On 13/01/2004 18:05, D. Starner wrote:

    >Peter Kirk writes:
    >>I agree that heuristics should be adjusted for Thai. But problems may
    >>arise if they have to be adjusted individually, and without regression
    >>errors, for all 6000+ world languages.
    >Thai is hard because of the writing system. But most writing systems weren't
    >encoded pre-Unicode, so if they were typed into a computer, it was with
    >a Latin (or Cyrillic?) transliteration that probably used spaces and new lines,
    >and in fact was probably ASCII.
    >More cynically, those who use obscure character sets or font encodings have
    >trouble viewing them; that is one of the reasons for Unicode. That this tool
    >may to some extent be an example of that problem is a simple fact of life,
    >and doesn't call for it to be thrown out.

    Either you are confused or I am. I was not referring to pre-Unicode
    legacy encodings. I was referring to Unicode plain text data which may
    (when Unicode includes all the necessary characters) be in any one of
    6000+ languages, some of which have a variety of scripts and spelling
    conventions. The problem is not that people are using obscure legacy
    encodings, but that they are not defining their UTF adequately.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 08:22:06 EST