Re: Case folding

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 09 2006 - 01:34:07 CDT

  • Next message: Werner LEMBERG: "CNS 11643-1992 plane 15"

    From: "Mike" <mike-list@pobox.com>
    >>> Some of the recent discussions have led me to
    >>> question my implementation of upper/lowercasing
    >>> and case folding. Currently I simply iterate
    >>> through a string exchanging characters with
    >>> their replacements. I don't first normalize
    >>> to any form, or do any reordering of combining
    >>> marks afterward.
    >>>
    >>> My question is, should I be doing these things?
    >>
    >> To answer this question, ask yourself what would happen if you
    >> uppercased the string "Straße" this way.
    >
    > I think I would get the right answer, "STRASSE" (if
    > that is the "sharp S" I have learned about a few weeks
    > back). I didn't mean I only did single character
    > replacement, I meant I didn't first normalize the
    > string before or after replacing the characters.

    Wrong. The case folding of the sharp s is a sharp s. The standard case folding does not convert any letter to uppercase.

    I have applications that perform case insensitive searches this way:
    closure{NFD, toLowerCase}( toUpperCase(NFD( string )) )

    It works for the special cases of German (sharp s), Turkish/Azeri (i/dottless i), and Greek (ypogegrammeni).

    In fact, the whole function is implemented using a precomputed map for the closure of characters, and there are some specific cases needed for some combining characters.

    ----
    Note that if you compare case insensitively and don't care about other variations (at secondary collation level or higher), you can reduce a lot the complexity of the algorithm and get much faster result using the following:
    toLowerCase( toUpperCase(filter(NFKD( string ))) )
    where the filter() function eliminates all combinining characters with combining class greater than zero. In fact the filter(NFKD()) composition can be precomputed easily by computing the closure of all decomposition mappings from the UCD main file, and eliminating all combining characters with non-zero combining classes.
    And for the same reason, the result of this function can be optimized by computing only the lowercase conversion, given that the inermediate conversion to uppercase is only there to handle the very few special cases like the german sharp s, which can be handled specially or can be precomputed within the previous precomputed conversion table.
    In fact, my filter() function also removes all format controls (such as Bidi overrides or embedding controls), and all ignorable characters (for collation). Such function is nearly what is used in NamePrep for IDN.
    If you have a library implementing NamePrep for IDN, then it could be used to create such simplified case folding at primary level (but be warned that it will unify all spaces, all dashes, many punctuation signs, and of course it will unify 'ç' and 'c', or 'ø' and 'o', i.e. it will remove attached or overlay diacritics as well as approximatively similar letters (even though they are considered distinct in collation at the primary level, using the default DUCET rule). But there are many applications where this is what is expected when one wants to perform caseless searches (for example when looking for similar words to detect typos and orthographic errors in texts, or in many plain-text search engines).
    The exact casefolding can then be used at a secondary level when ordering the search hits by relevance (in plain-text search engines for example). And the collation secondary level can be used at a last level only when ordering results with similar relevance by their letter case. If you start doing that, then it's best to simply use a tailored UCA collation rule, and then compute once the UCA collation keys with a single text scan, notably if the text to scan to find hits is large (for example when looking for text on all files with text in a whole hard disk, possibly including files with various formats, like plain texts, PDF's, DOC's, HTML files, ...).
    Correctly implementing collation, optimizing it, and having it support tailorable collation rules, will be a great feature for a text handling library.
    And finally having in the library a function that takes a collation key and computes one possible Unicode string with this collation key will be extremely useful. Such inverse function is not described for now in Unicode, as there's no clear indication about which Unicode string should be returned, given that there does exist an infinite number of Unicode strings sharing exactly the same UCA collation key, using a tailored collation rule or even the default DUCET, so that the choice of one Unicode string is necessarily based on arbitrary rules (such as not inserting any collation-ignorable character, or selecting the characters with the smallest codepoints, or selecting the shortest clusters) independantly of the various canonical equivalents (and the possibly arbitrarly chosen normalization form for the result).
    


    This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 01:46:02 CDT