RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat May 10 2003 - 22:04:39 EDT

Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

Previous message: Allen Haaheim: "Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
In reply to: Maurice Bauhahn: "RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Next in thread: John Delacour: "Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sat, 10 May 2003, Maurice Bauhahn wrote:

> It would appear to be a three step process:
>
> (1) First, detect whether there are patterns reflecting single or multiple
> byte encoding and separate the text into apparent units. Hence work out
> for the last two). I'm not aware of Shift-JIS, Big5, or EUC encoding
> patterns, but presumably there are some characters for these. The units

SJIS/Big5/JOHAB/GBK/GB18030 form a class of ISO-2022 incompliant
multibyte CJK encodings while EUC-JP, EUC-KR, EUC-CN and EUC-TW are ISO
2022 compliant CJK multibyte encodings. ISO-2022-JP(-x), ISO-2022-KR,
ISO-2022-CN belong to another class of ISO 2022 compliant encodings
that use ISO 2022 escape sequences. HZ is kinda a class of its own.
For details, see Ken Lunde's CJKV Information Processing.

> (2) Second, compare this list against a hash of reference frequencies versus

> (3) Third, with a generous bit of fuzzy logic (!!), test against the most
> likely encodings (normalising the assumed code points to Unicode) and run

These are all good advices. As already mentioned, the final touch would
be to let user override what your program come up with. Web browsers
also need this encoding detection technique (there are numerous unlabelled
or mislabelled web pages and email messages) and Mozilla has a couple of
them ('universal' and lang/script specific. needless to say, the latter
has a higher chance of getting it right than the former). Take a look
at intl/unichardet in Mozilla's CVS.

Jungshik

Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Previous message: Allen Haaheim: "Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
In reply to: Maurice Bauhahn: "RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Next in thread: John Delacour: "Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat May 10 2003 - 22:46:08 EDT