RE: Normalization in panlingual application

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 19:21:03 CDT

  • Next message: Kenneth Whistler: "Re: Normalization in panlingual application"

    Asmus Freytag wrote:
    > What is an invalid distinction is defined by your application. If you
    > case-fold, case is an invalid distinction. If your goal is to be able to
    > represent text faithfully, then the "K" series of normalizations has no
    > place in your design (It's too haphazard - for example, also, 5 would
    > be turned into 51/4, which is decidedly not the same thing).

    This is only a problem is the converted text has to be plain-text. If the
    target of the project is to allow building rich-text documents from the
    corpus, then conversion using NFKC becomes possible, for example using some
    XML annotation, the conversion would be something like
    "5<fraction>1/4</fraction>" and will in NFKC form (also in NFC form by
    definition, as well as, here, in NFC and NFD forms).

    Such annotation changes the nature of texts, by turning them not in linear
    suites of characters by mapping a structure on top of this. Given the
    expected usage, parsing texts to mapa structure on top of them will probably
    be a bonus, as it will ease later reuse of the converted corpus, within
    different contexts.



    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:23:03 CDT