Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

From: Philippe Verdy (
Date: Fri Jun 27 2003 - 09:43:26 EDT

  • Next message: Philippe Verdy: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"

    On Friday, June 27, 2003 3:36 PM, Jony Rosenne <> wrote:

    > For Hebrew and Arabic, add a step: Find the root, remove prefixes,
    > suffixes and other grammatical artifacts and obtain the base form of
    > the word.

    Removing common suffixes is a separate issue (this requires unification of lexically similar words, and can just consists in adding multiple search tokens by removing 1, 2 or 3 letters at end).

    However I'd like to have information about the common prefixes used in Hebrew or Arabic, and whever it can be detected in a language-neutral way (using only the script information), because this combines with the additional simple suffix removal technics. So I don't want to create too many tokens from a single word.

    I suppose this was clear in the rest of my message, because I would like to avoid dictionnary-based approaches, in contexts where the language is unknown, and only the script information (i.e. the encoded plain text) is available (including for Thai), so that it can be documented and implemented as a minimum tokenizing algorithm, simple to implement across platforms and programming languages.

    > Nearly nobody does it, and searches in these languages are less
    > useful than parallel searches in other languages.

    I looked in some related projects, like Jakarta Lucene, and this does not seem documented there, letting users write their own "Analyzer" class to tokenize text.

    I don't want a system that will create the best tokens, only a system that can produce reasonnably good and sufficient search tokens, letting users add a few tokens for known variants in their plain-text searches if needed, or letting users insert simplified keywords in their plain-text documents so that they become easily indexable and searchable.

    Some examples: in German, the composed words typically don't have any separator. It's quite hard, without an actual German dictionnary and recognizing that the text is actually in German, to get all the best tokens from a single composed word. However, I expect that these words will be splitted somewhere in the document (for example prefixes agglutinated to an infinitive verb).

    Can I reasonnably expect the same thing with text using agglutinating languages like Finnish or Hugarian ?

    Now comes the real difficulty: can I resonnably split a Thai or Chinese sentence into tokens that may not match exactly an actual word, but that may still contain enough information to allow filtering relevant texts containing those sequences?

    My first intent for Chinese was to split long sequences of Han ideographic characters into sequences of 2 or 3 Han characters, at each position in the sequence made of characters with the same general category, and the same script property. This would paliate the absence of spaces, while also giving a good selectivity for searches.

    Note that the documents I need to index are not extremely long (most of them will be below 1KB and would consist in descriptive paragraphs for a longer document, or it would consist in an extraction of the first 4KB of the plain text document, which generally contains an introduction, and reasonnably descriptive titles), so I will limit the number of indexable tokens with the longest ones, or the least frequent ones (with the help a global statistic database where indexed tokens are hashed and counted globally across documents that are part of the same collection).

    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 10:25:37 EDT