Re: Unicode and Levenshtein?

From: Theodore H. Smith (delete@elfdata.com)
Date: Thu Jan 06 2005 - 21:13:49 CST

  • Next message: Arcane Jill: "ISO 10646 & GB18030 repetoire"

    Thanks.

    I'm not very familiar with the details of each normalisation form,
    except that one of the few forms is composed and another is decomposed.
    I'm also not very familiar with UCA.

    I'll have to do my reading on these two things, before I can understand
    your answer ;o)

    On 7 Jan 2005, at 00:10, Mark Davis wrote:

    > The simplest would be to use the NFD form of the string, and
    > performing the
    > same calculations as you would otherwise, character by character.
    >
    > The alternative would be to work with the Unicode Collation Algorithm
    > data.
    > That will probably give you better results. For example, it would count
    > "abc" vs "Abc" as a much smaller difference than versus "zbc", and a
    > smaller
    > difference than versus "äbc". However, it would take some work to
    > adapt the
    > algorithm, since the UCA data provides levels of differences between
    > characters that you would have to weight appropriately in the
    > algorithm. For
    > example, you'd have to determine how to weight "ab" vs "bä", probably
    > counting that as the weight of a transposition plus the weight of an
    > accentless match.
    >
    > (If you are interested, you can use ICU to get both UCA data and NFD
    > transforms, in C, C++, and Java. For all the major search engines
    > except
    > Google, "ICU Unicode" gets you to ICU as the first choice. Very odd
    > that
    > Google doesn't -- maybe they don't like us ;-)
    >
    > ‎Mark
    >
    > ----- Original Message -----
    > From: "Theodore H. Smith" <delete@elfdata.com>
    > To: "ecartis" <unicode@unicode.org>
    > Sent: Thursday, January 06, 2005 08:54
    > Subject: Unicode and Levenshtein?
    >
    >
    >>
    >> How would I do levenshtein operations on a decomposed Unicode string?
    >> (http://www.merriampark.com/ld.htm)
    >>
    >> Here is the problem:
    >> 1) Levenshtein uses a preallocated 2d matrix.
    >> 2) Unicode chars can be decomposed, taking a large number of
    >> code-points per char.
    >>
    >> OK. I'm guessing it would be done like this. First, both strings must
    >> be split into arrays of codepoints. So instead of 1 string, we now
    >> have
    >> a 2d array of codepoints! Most of the arrays will contain only 1
    >> code-point, as Unicode tends to go. But some will contain multiple
    >> codepoints.
    >>
    >> So, then instead of reading the string 1 byte at a time, we read it
    >> one
    >> codepoint-array at a time! If the contents of one code-point-array
    >> don't match the contents of the other codepoint-array, then the
    >> "character" is seen to be different.
    >>
    >> Thats how it should be done, right?
    >>
    >> In this case, the same solution will work for both UTF8, and UTF32!
    >>
    >> Thus, I can keep my byte-wise code and still be Unicode compliant :oD
    >>
    >>
    >> --
    >> Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
    >> Industrial strength string processing code, made easy.
    >> (If you believe that's an oxymoron, see for yourself.)
    >>
    >>
    >>
    >
    >
    >
    >

    --
        Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
        Industrial strength string processing code, made easy.
        (If you believe that's an oxymoron, see for yourself.)
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 21:18:01 CST