Re: Unicode and Levenshtein?

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Jan 06 2005 - 18:10:23 CST

  • Next message: Theodore H. Smith: "Re: Unicode and Levenshtein?"

    The simplest would be to use the NFD form of the string, and performing the
    same calculations as you would otherwise, character by character.

    The alternative would be to work with the Unicode Collation Algorithm data.
    That will probably give you better results. For example, it would count
    "abc" vs "Abc" as a much smaller difference than versus "zbc", and a smaller
    difference than versus "äbc". However, it would take some work to adapt the
    algorithm, since the UCA data provides levels of differences between
    characters that you would have to weight appropriately in the algorithm. For
    example, you'd have to determine how to weight "ab" vs "bä", probably
    counting that as the weight of a transposition plus the weight of an
    accentless match.

    (If you are interested, you can use ICU to get both UCA data and NFD
    transforms, in C, C++, and Java. For all the major search engines except
    Google, "ICU Unicode" gets you to ICU as the first choice. Very odd that
    Google doesn't -- maybe they don't like us ;-)

    ‎Mark

    ----- Original Message -----
    From: "Theodore H. Smith" <delete@elfdata.com>
    To: "ecartis" <unicode@unicode.org>
    Sent: Thursday, January 06, 2005 08:54
    Subject: Unicode and Levenshtein?

    >
    > How would I do levenshtein operations on a decomposed Unicode string?
    > (http://www.merriampark.com/ld.htm)
    >
    > Here is the problem:
    > 1) Levenshtein uses a preallocated 2d matrix.
    > 2) Unicode chars can be decomposed, taking a large number of
    > code-points per char.
    >
    > OK. I'm guessing it would be done like this. First, both strings must
    > be split into arrays of codepoints. So instead of 1 string, we now have
    > a 2d array of codepoints! Most of the arrays will contain only 1
    > code-point, as Unicode tends to go. But some will contain multiple
    > codepoints.
    >
    > So, then instead of reading the string 1 byte at a time, we read it one
    > codepoint-array at a time! If the contents of one code-point-array
    > don't match the contents of the other codepoint-array, then the
    > "character" is seen to be different.
    >
    > Thats how it should be done, right?
    >
    > In this case, the same solution will work for both UTF8, and UTF32!
    >
    > Thus, I can keep my byte-wise code and still be Unicode compliant :oD
    >
    >
    > --
    > Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
    > Industrial strength string processing code, made easy.
    > (If you believe that's an oxymoron, see for yourself.)
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 18:15:30 CST