Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 27 2003 - 11:14:14 EDT

Next message: John Cowan: "Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]"

Previous message: Michael Everson: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
In reply to: Ben Dougall: "Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Friday, June 27, 2003 4:44 PM, Ben Dougall <bend@freenet.co.uk> wrote:

> i'm a bit confused. i thought that this type of thing was already
> pretty well covered by the various unicode resources? (i guess there's
> a strong chance not, if you're asking this question).

I'm not discussing about how Unicode describes the algorithm, and I am
not attempting to change thme, but how to use them on actual languages

> i don't see how language differences come into this. the japanese no
> space thing you mention: if someone types in a particular phrase, in
> japanese (therefore without spaces, if that is actually the case),
> then the search will not try and use spaces. and the text that they're
> searching will not be using spaces as it'll also be in japanese.

My problem is that I cannot predict correctly the actual language of the
indexed document (so I cannot build and use a dictionnary-based stemmer
that would work for all languages). I just want to see the impact of such
canonicalization of text (based only on its encoded script) on actual
languages.

> all that 'remove' and 'replace' part - you don't have to transform the
> text, surely you just have to set up rules (or filters) within the
> code that says for example "a or any number of tabs + a or any number
> of spaces = 1 space". and if you apply those rules *throughout*, to
> the text being searched, and the text strings that are inputted and
> searched for, then all'll be cool (?) maybe.
>
> > - replace all dashes with a standard ASCII minus-hyphen
>
> like that part. i wouldn't replace or change any text in any way. i'd
> just say in the code that any dash amounts to any other dash (and 'any
> dash' = what you mean by 'all dashes')
>
> basically i wouldn't go about changing characters. just allowing them
> to represent an array of characters (including nothing/no characters
> in some cases maybe)
>
>
> so it's 2 main basic things: convert to base format throughout, and
> set up rules / filters for characters (which will make heavy use of
> data, (is it the 'properties' data? - for character grouping and
> mappings) from unicode, plus a bit more of your own such as saying a
> variable long line of any white space amounts to one space, if you'd
> want things with variable amounts of space in to match that is.

the additional steps are required because any search system requires
using the same analyze algorithm for both the document indexer
that generates the index, and the parser that will create a search
string to match later in the index.

If I want that the search string be performed independantly of the
index actually used, I need a convention about how the index is
computed. I don't want to rescan the indexed documents each time
a search string comes in, and I want that the indexer be physically
separate from the search client (there will be distinct implementations
of the client for the same preindexed database of documents, and
additional indexes will come later).

Of course the unmodified search string can be sent to the indexer,
that will use the same rules as the one used for its database, but
this does not solve the problem of selecting which index to use
when there are many ones precompiled from other sources,
because I want to be able to distribute the index locally up to the
clients, and not to a central indexer.

And I don't know how to distribute the index without either forcing
clients to use the same indexer algorithm, either with a specification,
or through a downloaded applet that will not work on all client
platforms... And I don't want to write all possible client applets,
just one for one platform...

So what I ask here is that there may exist some specification that
do work across all languages supported by Unicode, but without
knowledge of the indexed language.

Next message: John Cowan: "Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]"
Previous message: Michael Everson: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
In reply to: Ben Dougall: "Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 11:50:51 EDT