RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat May 10 2003 - 22:04:39 EDT

  • Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

    On Sat, 10 May 2003, Maurice Bauhahn wrote:

    > It would appear to be a three step process:
    >
    > (1) First, detect whether there are patterns reflecting single or multiple
    > byte encoding and separate the text into apparent units. Hence work out
    > for the last two). I'm not aware of Shift-JIS, Big5, or EUC encoding
    > patterns, but presumably there are some characters for these. The units

       SJIS/Big5/JOHAB/GBK/GB18030 form a class of ISO-2022 incompliant
    multibyte CJK encodings while EUC-JP, EUC-KR, EUC-CN and EUC-TW are ISO
    2022 compliant CJK multibyte encodings. ISO-2022-JP(-x), ISO-2022-KR,
    ISO-2022-CN belong to another class of ISO 2022 compliant encodings
    that use ISO 2022 escape sequences. HZ is kinda a class of its own.
    For details, see Ken Lunde's CJKV Information Processing.

    > (2) Second, compare this list against a hash of reference frequencies versus

    > (3) Third, with a generous bit of fuzzy logic (!!), test against the most
    > likely encodings (normalising the assumed code points to Unicode) and run

     These are all good advices. As already mentioned, the final touch would
    be to let user override what your program come up with. Web browsers
    also need this encoding detection technique (there are numerous unlabelled
    or mislabelled web pages and email messages) and Mozilla has a couple of
    them ('universal' and lang/script specific. needless to say, the latter
    has a higher chance of getting it right than the former). Take a look
    at intl/unichardet in Mozilla's CVS.

      Jungshik



    This archive was generated by hypermail 2.1.5 : Sat May 10 2003 - 22:46:08 EDT