Re: encoding sniffing

From: Patrick Andries (Patrick.Andries@xcential.com)
Date: Mon Jul 14 2003 - 18:51:41 EDT

  • Next message: Peter Kirk: "Aramaic, Samaritan, Phoenician"

    ----- Message d'origine -----
    De: "Philippe Verdy" <verdy_p@wanadoo.fr>

    > On Monday, July 14, 2003 11:42 PM, Patrick Andries
    <Patrick.Andries@xcential.com> wrote:
    >
    > > In any case, I believe Peter has an idea how these libraries work and
    > > their limitations, he is rather looking for one with its limitations.
    >
    > Including the Chinese limitations? It will become tricky if managing with
    traditional or scientific texts with many rare ideographs, because it's
    difficult to create an exhaustive morphological analysis with Chinese,

    This product does no morphological analysis but uses a hidden Markov Model.
    Did you try it ? (I just checked http://www.gov.tw/sars/ with
    http://quebec.alis.com/castil/essai_silc.cgi gave me Chinese, Big-5).

    Obviously the model is stochastic, but it can be fine-tuned by supplying a
    larger (domain specific if needed) tagged corpus.
    An improved version is used by Netscape (at least this was my impression
    when I left Alis).

    P. Andries
    - o - 0 - o -
    Textes Unicode en français
    http://pages.infinit.net/hapax



    This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 19:28:55 EDT