Re: encoding sniffing

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 14 2003 - 18:12:03 EDT

Next message: Kenneth Whistler: "Re: Nu Shu script"

Previous message: Peter Kirk: "Re: [Private Use Area] Audio Description, Subtitle, Signing"
In reply to: Patrick Andries: "Re: encoding sniffing"
Next in thread: Patrick Andries: "Re: encoding sniffing"
Reply: Patrick Andries: "Re: encoding sniffing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Monday, July 14, 2003 11:42 PM, Patrick Andries <Patrick.Andries@xcential.com> wrote:

> In any case, I believe Peter has an idea how these libraries work and
> their limitations, he is rather looking for one with its limitations.

Including the Chinese limitations? It will become tricky if managing with traditional or scientific texts with many rare ideographs, because it's difficult to create an exhaustive morphological analysis with Chinese, even with the three steps approach. So a simple recognizer without any morphological or lexical database would be even more likely to fail if the recognizer is not helped to include hints about the language or at least the main script (for example excluding the Han script from the statistic results).

With GB18030 encoding, this would be a real challenge due to its even larger overlap with the ASCII space. However its quite easy to determine which encoding a Chinese text uses with just the byte or double-byte statistics.

Next message: Kenneth Whistler: "Re: Nu Shu script"
Previous message: Peter Kirk: "Re: [Private Use Area] Audio Description, Subtitle, Signing"
In reply to: Patrick Andries: "Re: encoding sniffing"
Next in thread: Patrick Andries: "Re: encoding sniffing"
Reply: Patrick Andries: "Re: encoding sniffing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 18:53:01 EDT