Re: encoding sniffing

From: Philippe Verdy (
Date: Mon Jul 14 2003 - 18:12:03 EDT

    On Monday, July 14, 2003 11:42 PM, Patrick Andries <> wrote:

    > In any case, I believe Peter has an idea how these libraries work and
    > their limitations, he is rather looking for one with its limitations.

    Including the Chinese limitations? It will become tricky if managing with traditional or scientific texts with many rare ideographs, because it's difficult to create an exhaustive morphological analysis with Chinese, even with the three steps approach. So a simple recognizer without any morphological or lexical database would be even more likely to fail if the recognizer is not helped to include hints about the language or at least the main script (for example excluding the Han script from the statistic results).

    With GB18030 encoding, this would be a real challenge due to its even larger overlap with the ASCII space. However its quite easy to determine which encoding a Chinese text uses with just the byte or double-byte statistics.

