RE: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

From: Jony Rosenne (rosennej@qsm.co.il)
Date: Fri Jun 27 2003 - 09:36:48 EDT

  • Next message: Michael Everson: "Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]"

    For Hebrew and Arabic, add a step: Find the root, remove prefixes, suffixes
    and other grammatical artifacts and obtain the base form of the word.

    Nearly nobody does it, and searches in these languages are less useful than
    parallel searches in other languages.

    Jony

    > -----Original Message-----
    > From: unicode-bounce@unicode.org
    > [mailto:unicode-bounce@unicode.org] On Behalf Of Philippe Verdy
    > Sent: Friday, June 27, 2003 1:46 PM
    > To: unicode@unicode.org
    > Subject: SPAM: Plain-text search algorithms: normalization,
    > decomposition, case mapping, word breaks
    >
    >
    > In order to implement a plain-text search algorithm, in a
    > language neutral way that would still work with all scripts,
    > I am searching for advices on how this can be done "safely"
    > (notably for automated search engines), to allow searching
    > for text matching some basic encoding styles.
    >
    > My first approach to the problem is to try to simplify the
    > text into a indexable form that would unify "similar"
    > characters. So I'd like to have comments about possible
    > issues in modern languages if I perform the following "search
    > canonicalization":
    >
    > - Decompose the string into NFKD (this will remove
    > font-related information and isolate combining marks)
    > - Remove all combining characters (with combining class > 0),
    > including Hebrew and Arabic cantillation. (are there
    > significant combining vowel signs that should be kept?)
    > - apply case folding using the Unicode standard (to lowercase
    > preferably)
    > - possibly perform a closure of the above three transforms
    > - remove all controls, excepting TAB, CR, LF, VT, FF
    > - replace all dashes with a standard ASCII minus-hyphen
    > - replace all spacing characters with an ASCII space
    > - replace all other punctuation with spaces.
    > - canonicalize the remaining spaces (no leading and trailing
    > spaces, and alls other sequences replaced with a single space).
    > - (may be) recompose Korean Hangul syllables?
    >
    > What are the possible caveats, notably for Japanese, Korean
    > and Chinese which traditionally do not use spaces ?
    >
    > How can we improve the algorithm for searches in Thai without
    > using a dictionnary, so that word breaks could be more easily
    > detected (and marked by inserting a ASCII space) ?
    >
    > Should I insert a space when there's a change of script type
    > (for example in Japanese, between Hiragana, Katakana, Latin
    > and Kanji ideographs) ?
    >
    > Is there an existing and documented conversion table used in
    > plain-text search engines ?
    >
    > Is Unicode working on such search-canonicalization algorithm ?
    >
    > Thanks for the comments.
    >
    > -- Philippe.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 09:13:50 EDT