From: Ben Dougall (bend@freenet.co.uk)
Date: Fri Jun 27 2003 - 10:44:15 EDT
i'm a bit confused. i thought that this type of thing was already 
pretty well covered by the various unicode resources? (i guess there's 
a strong chance not, if you're asking this question).
this is the way i see it:
it's for you to decide which format you internally normalise to (i'm 
not even sure if that's the right word). to which specific *base 
format* you decide to adhere to. (i'm talking about things like do you 
treat text in a composed or decomposed form for example). it doesn't 
matter which internal base format you choose, so long as you stick to 
it and never try to compare two texts in different 'base formats'. then 
on top of that you'd need to also apply a way to make use of character 
mappings - when you get various versions of characters amounting to the 
same meaning. (there's different levels to that and decisions for you 
to make - no right nor wrong. the extent to which you allow various 
character to amount to the same one. (this includes case mappings for 
example obviously)
i don't see how language differences come into this. the japanese no 
space thing you mention: if someone types in a particular phrase, in 
japanese (therefore without spaces, if that is actually the case), then 
the search will not try and use spaces. and the text that they're 
searching will not be using spaces as it'll also be in japanese.
all that 'remove' and 'replace' part - you don't have to transform the 
text, surely you just have to set up rules (or filters) within the code 
that says for example "a or any number of tabs + a or any number of 
spaces = 1 space". and if you apply those rules *throughout*, to the 
text being searched, and the text strings that are inputted and 
searched for, then all'll be cool (?) maybe.
> - replace all dashes with a standard ASCII minus-hyphen
like that part. i wouldn't replace or change any text in any way. i'd 
just say in the code that any dash amounts to any other dash (and 'any 
dash' = what you mean by 'all dashes')
basically i wouldn't go about changing characters. just allowing them 
to represent an array of characters (including nothing/no characters in 
some cases maybe)
so it's 2 main basic things: convert to base format throughout, and set 
up rules / filters for characters (which will make heavy use of data, 
(is it the 'properties' data? - for character grouping and mappings) 
from unicode, plus a bit more of your own such as saying a variable 
long line of any white space amounts to one space, if you'd want things 
with variable amounts of space in to match that is.
On Friday, June 27, 2003, at 12:46  pm, Philippe Verdy wrote:
> In order to implement a plain-text search algorithm, in a language 
> neutral way that would still work with all scripts, I am searching for 
> advices on how  this can be done "safely" (notably for automated 
> search engines), to allow searching for text matching some basic 
> encoding styles.
>
> My first approach to the problem is to try to simplify the text into a 
> indexable form that would unify "similar" characters.
> So I'd like to have comments about possible issues in modern languages 
> if I perform the following "search canonicalization":
>
> - Decompose the string into NFKD (this will remove font-related 
> information and isolate combining marks)
> - Remove all combining characters (with combining class > 0), 
> including Hebrew and Arabic cantillation.
>  (are there significant combining vowel signs that should be kept?)
> - apply case folding using the Unicode standard (to lowercase 
> preferably)
> - possibly perform a closure of the above three transforms
> - remove all controls, excepting TAB, CR, LF, VT, FF
> - replace all dashes with a standard ASCII minus-hyphen
> - replace all spacing characters with an ASCII space
> - replace all other punctuation with spaces.
> - canonicalize the remaining spaces (no leading and trailing spaces, 
> and alls other sequences replaced with a single space).
> - (may be) recompose Korean Hangul syllables?
>
> What are the possible caveats, notably for Japanese, Korean and 
> Chinese which traditionally do not use spaces ?
>
> How can we improve the algorithm for searches in Thai without using a 
> dictionnary, so that word breaks could be more easily detected (and 
> marked by inserting a ASCII space) ?
>
> Should I insert a space when there's a change of script type (for 
> example in Japanese, between Hiragana, Katakana, Latin and Kanji 
> ideographs) ?
>
> Is there an existing and documented conversion table used in 
> plain-text search engines ?
>
> Is Unicode working on such search-canonicalization algorithm ?
>
> Thanks for the comments.
>
> -- Philippe.
>
>
This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 11:31:12 EDT