Re: Normalization in panlingual application

From: Mark Davis (mark.davis@icu-project.org)
Date: Thu Sep 20 2007 - 12:44:58 CDT

  • Next message: John D. Burger: "Re: Normalization in panlingual application"

    A few observations.

    1. IDNA does use NFKD. The mappings are duplicated in the spec,
    because case mapping is also applied, and they are filtered because
    characters that are disallowed before or after don't need to have
    mappings. NFKD works well for identifiers, even ones that are more
    "human language like", since the characters that behave oddly are
    typically not allowed anyway.

    2. It's important to be clear about folding for *matching* as being a
    different kind of process than "normalization". When you are matching,
    you don't actually alter the text that you store, instead, you
    (logically) transform both the search text and the indexed text so
    that a binary comparison erases distinctions that are less relevant to
    matching. Matching may be language dependent also -- thus you may want
    to match a-ring against aa for Danish. You also want to match the
    other cases that are not canonical or compatiblity equivalences, such
    as curly quote marks against straight quote marks. So while NFC is a
    starting point for matching text, it isn't enough.

    3. Matching for search can be tolerant of a certain degree of
    imprecision. You could alter the mapping for ½ to <space>1/2<space>,
    but it just simply doesn't matter much if 5½ folds to 51/2 for
    searching, since you just won't get any appreciable number of false
    positives, and users will just skip over the vanishingly small number
    that are found.

    4. What we've found is that using most of the NFKC mappings, plus case
    folding, plus some of the UCA mappings, plus a few others, gives a
    pretty good result for the language-independent matching.
    (Language-dependent matching is more complicated.)



    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 12:47:21 CDT