Unicode and Levenshtein?

From: Theodore H. Smith (delete@elfdata.com)
Date: Thu Jan 06 2005 - 10:54:27 CST

  • Next message: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"

    How would I do levenshtein operations on a decomposed Unicode string?
    (http://www.merriampark.com/ld.htm)

    Here is the problem:
    1) Levenshtein uses a preallocated 2d matrix.
    2) Unicode chars can be decomposed, taking a large number of
    code-points per char.

    OK. I'm guessing it would be done like this. First, both strings must
    be split into arrays of codepoints. So instead of 1 string, we now have
    a 2d array of codepoints! Most of the arrays will contain only 1
    code-point, as Unicode tends to go. But some will contain multiple
    codepoints.

    So, then instead of reading the string 1 byte at a time, we read it one
    codepoint-array at a time! If the contents of one code-point-array
    don't match the contents of the other codepoint-array, then the
    "character" is seen to be different.

    Thats how it should be done, right?

    In this case, the same solution will work for both UTF8, and UTF32!

    Thus, I can keep my byte-wise code and still be Unicode compliant :oD

    --
        Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
        Industrial strength string processing code, made easy.
        (If you believe that's an oxymoron, see for yourself.)
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 11:01:44 CST