Unicode and Levenshtein?

From: Theodore H. Smith (delete@elfdata.com)
Date: Thu Jan 06 2005 - 10:54:27 CST

Next message: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"

Previous message: Christopher Fynn: "Automatic transliteration between scripts -(was: Re: ISO 10646 compliance and EU law)"
Next in thread: Philippe Verdy: "Re: Unicode and Levenshtein?"
Reply: Philippe Verdy: "Re: Unicode and Levenshtein?"
Reply: Mark Davis: "Re: Unicode and Levenshtein?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

How would I do levenshtein operations on a decomposed Unicode string?
(http://www.merriampark.com/ld.htm)

Here is the problem:
1) Levenshtein uses a preallocated 2d matrix.
2) Unicode chars can be decomposed, taking a large number of
code-points per char.

OK. I'm guessing it would be done like this. First, both strings must
be split into arrays of codepoints. So instead of 1 string, we now have
a 2d array of codepoints! Most of the arrays will contain only 1
code-point, as Unicode tends to go. But some will contain multiple
codepoints.

So, then instead of reading the string 1 byte at a time, we read it one
codepoint-array at a time! If the contents of one code-point-array
don't match the contents of the other codepoint-array, then the
"character" is seen to be different.

Thats how it should be done, right?

In this case, the same solution will work for both UTF8, and UTF32!

Thus, I can keep my byte-wise code and still be Unicode compliant :oD

--
    Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
    Industrial strength string processing code, made easy.
    (If you believe that's an oxymoron, see for yourself.)

Next message: Philippe VERDY: "Re: Re: ISO 10646 compliance and EU law"
Previous message: Christopher Fynn: "Automatic transliteration between scripts -(was: Re: ISO 10646 compliance and EU law)"
Next in thread: Philippe Verdy: "Re: Unicode and Levenshtein?"
Reply: Philippe Verdy: "Re: Unicode and Levenshtein?"
Reply: Mark Davis: "Re: Unicode and Levenshtein?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 11:01:44 CST