Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

From: Ben Dougall (
Date: Fri Jun 27 2003 - 10:44:15 EDT

  • Next message: Michael Everson: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"

    i'm a bit confused. i thought that this type of thing was already
    pretty well covered by the various unicode resources? (i guess there's
    a strong chance not, if you're asking this question).

    this is the way i see it:

    it's for you to decide which format you internally normalise to (i'm
    not even sure if that's the right word). to which specific *base
    format* you decide to adhere to. (i'm talking about things like do you
    treat text in a composed or decomposed form for example). it doesn't
    matter which internal base format you choose, so long as you stick to
    it and never try to compare two texts in different 'base formats'. then
    on top of that you'd need to also apply a way to make use of character
    mappings - when you get various versions of characters amounting to the
    same meaning. (there's different levels to that and decisions for you
    to make - no right nor wrong. the extent to which you allow various
    character to amount to the same one. (this includes case mappings for
    example obviously)

    i don't see how language differences come into this. the japanese no
    space thing you mention: if someone types in a particular phrase, in
    japanese (therefore without spaces, if that is actually the case), then
    the search will not try and use spaces. and the text that they're
    searching will not be using spaces as it'll also be in japanese.

    all that 'remove' and 'replace' part - you don't have to transform the
    text, surely you just have to set up rules (or filters) within the code
    that says for example "a or any number of tabs + a or any number of
    spaces = 1 space". and if you apply those rules *throughout*, to the
    text being searched, and the text strings that are inputted and
    searched for, then all'll be cool (?) maybe.

    > - replace all dashes with a standard ASCII minus-hyphen

    like that part. i wouldn't replace or change any text in any way. i'd
    just say in the code that any dash amounts to any other dash (and 'any
    dash' = what you mean by 'all dashes')

    basically i wouldn't go about changing characters. just allowing them
    to represent an array of characters (including nothing/no characters in
    some cases maybe)

    so it's 2 main basic things: convert to base format throughout, and set
    up rules / filters for characters (which will make heavy use of data,
    (is it the 'properties' data? - for character grouping and mappings)
    from unicode, plus a bit more of your own such as saying a variable
    long line of any white space amounts to one space, if you'd want things
    with variable amounts of space in to match that is.

    On Friday, June 27, 2003, at 12:46 pm, Philippe Verdy wrote:

    > In order to implement a plain-text search algorithm, in a language
    > neutral way that would still work with all scripts, I am searching for
    > advices on how this can be done "safely" (notably for automated
    > search engines), to allow searching for text matching some basic
    > encoding styles.
    > My first approach to the problem is to try to simplify the text into a
    > indexable form that would unify "similar" characters.
    > So I'd like to have comments about possible issues in modern languages
    > if I perform the following "search canonicalization":
    > - Decompose the string into NFKD (this will remove font-related
    > information and isolate combining marks)
    > - Remove all combining characters (with combining class > 0),
    > including Hebrew and Arabic cantillation.
    > (are there significant combining vowel signs that should be kept?)
    > - apply case folding using the Unicode standard (to lowercase
    > preferably)
    > - possibly perform a closure of the above three transforms
    > - remove all controls, excepting TAB, CR, LF, VT, FF
    > - replace all dashes with a standard ASCII minus-hyphen
    > - replace all spacing characters with an ASCII space
    > - replace all other punctuation with spaces.
    > - canonicalize the remaining spaces (no leading and trailing spaces,
    > and alls other sequences replaced with a single space).
    > - (may be) recompose Korean Hangul syllables?
    > What are the possible caveats, notably for Japanese, Korean and
    > Chinese which traditionally do not use spaces ?
    > How can we improve the algorithm for searches in Thai without using a
    > dictionnary, so that word breaks could be more easily detected (and
    > marked by inserting a ASCII space) ?
    > Should I insert a space when there's a change of script type (for
    > example in Japanese, between Hiragana, Katakana, Latin and Kanji
    > ideographs) ?
    > Is there an existing and documented conversion table used in
    > plain-text search engines ?
    > Is Unicode working on such search-canonicalization algorithm ?
    > Thanks for the comments.
    > -- Philippe.

    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 11:31:12 EDT