Re: Normalization in panlingual application

From: Asmus Freytag (
Date: Thu Sep 20 2007 - 07:07:25 CDT

  • Next message: John D. Burger: "Re: Normalization in panlingual application"

    On 9/19/2007 6:04 PM, Philippe Verdy wrote:
    > Asmus Freytag [] wrote:
    >> You realize, also, that it is not (in the general case) possible to
    >> apply normalization piece-meal. Because of that, breaking the text into
    >> runs and then normalizing can give different results (in some cases),
    >> which makes pre-processing a dicey option.
    > That's not my opinion.
    The result that for many strings s and t, NFxx(s) + NFxx(t) != NFxx(s +
    t) is not a matter of opinion. For these strings, you cannot normalize
    them separately and then concatenate, and expect the result to be the
    normalized from of the two strings. UAX#15 is rather clear about that.
    > At least the first step of the conversion (converting
    > to NFC) is very safe and preserves differences, using standard programs
    > (which are widely available, so this step represents norisk). Detecting
    > compatible characters and mapping them to annoted forms can be applied after
    > this step in a very straightforward thing.
    I had written:
    > > Since none of the common libraries that implement normalization forms
    > > perform the necessary mappings to markup out of the box, anyone
    > > contemplating such a scheme would be forced to implement either a
    > > pre-processing step, or their own normalization logic. This is a
    > > downright scary suggestion, since such an approach would lose the
    > > benefit of using well-tested implementation. Normalization is tricky
    > > enough that one should try to not implement if from scratch if all
    > > possible.
    your approach confirms what I suspect. By suggesting an approach like
    this, you are advocating de-novo implementation of normalization
    transformation. By the way, NFC would be a poor starting point for your
    scheme, since all normalization forms start with an (implied) first step
    of applying *de*composition. But you can't even start with NFD, since
    the minute you decompose any compatibility characters in your following
    step, you can in principle create sequences that denormalize the
    existing NFD string around it. The work to handle these exceptions,
    amounts to a full implementation of normalization, logically speaking.
    In other words, you've lost the benefit of your library.

    It's precisely the fact that normalization is unexpectedly tricky in its
    details that anything other than using an established library to apply
    NFC or NFD, should not be contemplated in a *practical* implementation.

    Ken's suggestions on how to deal with the overall situation were so much
    less speculative and so much more to the point. Further, they spoke from
    a point of real experience. His suggestion, to recap it here, was to
    apply NFC for internal storage, and to apply the necessary foldings when
    analyzing or presenting the data (depending on the view). By using
    foldings that are themselves designed to preserve NFC (see UTS#30) an
    implementer can be in control of how the data is massaged, without
    having to re-implement and re-test normalization.
    > But I won't recommend applying
    > blindly a NFKC/D transformation ..
    Well at least we agree on that much.


    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 07:09:53 CDT