Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

From: Philippe Verdy (
Date: Fri Jun 27 2003 - 11:14:14 EDT

  • Next message: John Cowan: "Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]"

    On Friday, June 27, 2003 4:44 PM, Ben Dougall <> wrote:

    > i'm a bit confused. i thought that this type of thing was already
    > pretty well covered by the various unicode resources? (i guess there's
    > a strong chance not, if you're asking this question).

    I'm not discussing about how Unicode describes the algorithm, and I am
    not attempting to change thme, but how to use them on actual languages

    > i don't see how language differences come into this. the japanese no
    > space thing you mention: if someone types in a particular phrase, in
    > japanese (therefore without spaces, if that is actually the case),
    > then the search will not try and use spaces. and the text that they're
    > searching will not be using spaces as it'll also be in japanese.

    My problem is that I cannot predict correctly the actual language of the
    indexed document (so I cannot build and use a dictionnary-based stemmer
    that would work for all languages). I just want to see the impact of such
    canonicalization of text (based only on its encoded script) on actual

    > all that 'remove' and 'replace' part - you don't have to transform the
    > text, surely you just have to set up rules (or filters) within the
    > code that says for example "a or any number of tabs + a or any number
    > of spaces = 1 space". and if you apply those rules *throughout*, to
    > the text being searched, and the text strings that are inputted and
    > searched for, then all'll be cool (?) maybe.
    > > - replace all dashes with a standard ASCII minus-hyphen
    > like that part. i wouldn't replace or change any text in any way. i'd
    > just say in the code that any dash amounts to any other dash (and 'any
    > dash' = what you mean by 'all dashes')
    > basically i wouldn't go about changing characters. just allowing them
    > to represent an array of characters (including nothing/no characters
    > in some cases maybe)
    > so it's 2 main basic things: convert to base format throughout, and
    > set up rules / filters for characters (which will make heavy use of
    > data, (is it the 'properties' data? - for character grouping and
    > mappings) from unicode, plus a bit more of your own such as saying a
    > variable long line of any white space amounts to one space, if you'd
    > want things with variable amounts of space in to match that is.

    the additional steps are required because any search system requires
    using the same analyze algorithm for both the document indexer
    that generates the index, and the parser that will create a search
    string to match later in the index.

    If I want that the search string be performed independantly of the
    index actually used, I need a convention about how the index is
    computed. I don't want to rescan the indexed documents each time
    a search string comes in, and I want that the indexer be physically
    separate from the search client (there will be distinct implementations
    of the client for the same preindexed database of documents, and
    additional indexes will come later).

    Of course the unmodified search string can be sent to the indexer,
    that will use the same rules as the one used for its database, but
    this does not solve the problem of selecting which index to use
    when there are many ones precompiled from other sources,
    because I want to be able to distribute the index locally up to the
    clients, and not to a central indexer.

    And I don't know how to distribute the index without either forcing
    clients to use the same indexer algorithm, either with a specification,
    or through a downloaded applet that will not work on all client
    platforms... And I don't want to write all possible client applets,
    just one for one platform...

    So what I ask here is that there may exist some specification that
    do work across all languages supported by Unicode, but without
    knowledge of the indexed language.

    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 11:50:51 EDT