Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 27 2003 - 07:46:20 EDT

  • Next message: Andrew C. West: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"

    In order to implement a plain-text search algorithm, in a language neutral way that would still work with all scripts, I am searching for advices on how this can be done "safely" (notably for automated search engines), to allow searching for text matching some basic encoding styles.

    My first approach to the problem is to try to simplify the text into a indexable form that would unify "similar" characters.
    So I'd like to have comments about possible issues in modern languages if I perform the following "search canonicalization":

    - Decompose the string into NFKD (this will remove font-related information and isolate combining marks)
    - Remove all combining characters (with combining class > 0), including Hebrew and Arabic cantillation.
     (are there significant combining vowel signs that should be kept?)
    - apply case folding using the Unicode standard (to lowercase preferably)
    - possibly perform a closure of the above three transforms
    - remove all controls, excepting TAB, CR, LF, VT, FF
    - replace all dashes with a standard ASCII minus-hyphen
    - replace all spacing characters with an ASCII space
    - replace all other punctuation with spaces.
    - canonicalize the remaining spaces (no leading and trailing spaces, and alls other sequences replaced with a single space).
    - (may be) recompose Korean Hangul syllables?

    What are the possible caveats, notably for Japanese, Korean and Chinese which traditionally do not use spaces ?

    How can we improve the algorithm for searches in Thai without using a dictionnary, so that word breaks could be more easily detected (and marked by inserting a ASCII space) ?

    Should I insert a space when there's a change of script type (for example in Japanese, between Hiragana, Katakana, Latin and Kanji ideographs) ?

    Is there an existing and documented conversion table used in plain-text search engines ?

    Is Unicode working on such search-canonicalization algorithm ?

    Thanks for the comments.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 08:31:16 EDT