Re: Normalization in panlingual application

From: Asmus Freytag (
Date: Wed Sep 19 2007 - 19:41:24 CDT

  • Next message: Philippe Verdy: "RE: Normalization in panlingual application"

    On 9/19/2007 5:21 PM, Philippe Verdy wrote:
    > Asmus Freytag wrote:
    >> What is an invalid distinction is defined by your application. If you
    >> case-fold, case is an invalid distinction. If your goal is to be able to
    >> represent text faithfully, then the "K" series of normalizations has no
    >> place in your design (It's too haphazard - for example, also, 5 would
    >> be turned into 51/4, which is decidedly not the same thing).
    > This is only a problem is the converted text has to be plain-text. If the
    > target of the project is to allow building rich-text documents from the
    > corpus, then conversion using NFKC becomes possible,
    A conversion that takes plain text to rich text, per definition, is not
    one of the Unicode Normalization Forms, but, at best, a higher-level
    protocol built on top a Normalization form. More on that below.
    > for example using some
    > XML annotation, the conversion would be something like
    > "5<fraction>1/4</fraction>" and will in NFKC form (also in NFC form by
    > definition, as well as, here, in NFC and NFD forms).
    Since none of the common libraries that implement normalization forms
    perform the necessary mappings to markup out of the box, anyone
    contemplating such a scheme would be forced to implement either a
    pre-processing step, or their own normalization logic. This is a
    downright scary suggestion, since such an approach would lose the
    benefit of using well-tested implementation. Normalization is tricky
    enough that one should try to not implement if from scratch if all possible.

    You realize, also, that it is not (in the general case) possible to
    apply normalization piece-meal. Because of that, breaking the text into
    runs and then normalizing can give different results (in some cases),
    which makes pre-processing a dicey option.

    > Such annotation changes the nature of texts, by turning them not in linear
    > suites of characters by mapping a structure on top of this.
    Precisely the point I made above.
    > Given the
    > expected usage, parsing texts to mapa structure on top of them will probably
    > be a bonus, as it will ease later reuse of the converted corpus, within
    > different contexts.
    That may be, but the better choice is to normalize according to NFC/NFD
    and then apply particular foldings, and/or substitutions of markup. The
    "K" series of normalization forms, by default, just "k"orrupt the data.



    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:43:33 CDT