RE: Normalization in panlingual application

From: Philippe Verdy (
Date: Wed Sep 19 2007 - 20:04:00 CDT

  • Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

    Asmus Freytag [] wrote:
    > Envoyé : jeudi 20 septembre 2007 02:41
    > À :
    > Cc : 'Jonathan Pool';
    > Objet : Re: Normalization in panlingual application
    > On 9/19/2007 5:21 PM, Philippe Verdy wrote:
    > > Asmus Freytag wrote:
    > >
    > >> What is an invalid distinction is defined by your application. If you
    > >> case-fold, case is an invalid distinction. If your goal is to be able
    > to
    > >> represent text faithfully, then the "K" series of normalizations has no
    > >> place in your design (It's too haphazard - for example, also, 5¼ would
    > >> be turned into 51/4, which is decidedly not the same thing).
    > >>
    > >
    > > This is only a problem is the converted text has to be plain-text. If
    > the
    > > target of the project is to allow building rich-text documents from the
    > > corpus, then conversion using NFKC becomes possible,
    > A conversion that takes plain text to rich text, per definition, is not
    > one of the Unicode Normalization Forms, but, at best, a higher-level
    > protocol built on top a Normalization form. More on that below.
    > > for example using some
    > > XML annotation, the conversion would be something like
    > > "5<fraction>1/4</fraction>" and will in NFKC form (also in NFC form by
    > > definition, as well as, here, in NFC and NFD forms).
    > >
    > Since none of the common libraries that implement normalization forms
    > perform the necessary mappings to markup out of the box, anyone
    > contemplating such a scheme would be forced to implement either a
    > pre-processing step, or their own normalization logic. This is a
    > downright scary suggestion, since such an approach would lose the
    > benefit of using well-tested implementation. Normalization is tricky
    > enough that one should try to not implement if from scratch if all
    > possible.
    > You realize, also, that it is not (in the general case) possible to
    > apply normalization piece-meal. Because of that, breaking the text into
    > runs and then normalizing can give different results (in some cases),
    > which makes pre-processing a dicey option.

    That's not my opinion. At least the first step of the conversion (converting
    to NFC) is very safe and preserves differences, using standard programs
    (which are widely available, so this step represents norisk). Detecting
    compatible characters and mapping them to annoted forms can be applied after
    this step in a very straightforward thing. But I won't recommend applying
    blindly a NFKC/D transformation from a large corpus of texts in different
    formats with different conventions without asking outself why some
    differences were encoded like this, using compatible characters (not in
    NFKC/D form).

    > > Such annotation changes the nature of texts, by turning them not in
    > linear
    > > suites of characters by mapping a structure on top of this.
    > Precisely the point I made above.
    > > Given the
    > > expected usage, parsing texts to mapa structure on top of them will
    > probably
    > > be a bonus, as it will ease later reuse of the converted corpus, within
    > > different contexts.
    > >
    > >
    > That may be, but the better choice is to normalize according to NFC/NFD
    > and then apply particular foldings, and/or substitutions of markup. The
    > "K" series of normalization forms, by default, just "k"orrupt the data.

    That's exactly what I suggested: avoiding blind usage of the K series for
    large corpus of texts. The particular foldings need to be applied by first
    looking into the corpus how much they affect the text, so this should be
    done by testing differences, counting them, extracting the differences to
    check the results, and then apply it to the selected sub-corpus, step by
    step. Such thing can't be automated as it requires some knowledge of the
    language or subject of the texts and reviewing by humans. Using NFKD blindly
    is very risky for large corpus as there's no way to revert the lossy

    Computing statistics about the usage of compatible characters remaining in
    the converted text will help manage the work to do, by helping making
    subselections. Also, if you keep in the corpus some other metadata
    indicating text source and usage or subject, this may help making
    subselections in a more relevant way. This way, any mapping needs not be
    applied to the whole corpus, but only where it is relevant. It's quite easy
    to compute such statistics, because compatible characters are known
    extensively, and there are not so many ones in Unicode. A very basic filter
    can be made to count their occurrences in the indexed texts.

    NFKC has its use, but in limited contexts: as the last chance step used when
    humane review is not possible, such as in full text search engines, to
    extend searches to nearly equivalent texts with approximative matches. In
    such usage, no text is actually produced and stored in NFKC, the corpus is
    not modified, but NFK(C/D) forms are still used to generate only indexing
    keys. I think it is its only safe application, provided that exact searches
    of canonically equivalent strings remains possible in a separate search

    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 20:05:39 CDT