From: Theodore H. Smith (delete@elfdata.com)
Date: Thu Jan 06 2005 - 10:54:27 CST
How would I do levenshtein operations on a decomposed Unicode string?
(http://www.merriampark.com/ld.htm)
Here is the problem:
1) Levenshtein uses a preallocated 2d matrix.
2) Unicode chars can be decomposed, taking a large number of
code-points per char.
OK. I'm guessing it would be done like this. First, both strings must
be split into arrays of codepoints. So instead of 1 string, we now have
a 2d array of codepoints! Most of the arrays will contain only 1
code-point, as Unicode tends to go. But some will contain multiple
codepoints.
So, then instead of reading the string 1 byte at a time, we read it one
codepoint-array at a time! If the contents of one code-point-array
don't match the contents of the other codepoint-array, then the
"character" is seen to be different.
Thats how it should be done, right?
In this case, the same solution will work for both UTF8, and UTF32!
Thus, I can keep my byte-wise code and still be Unicode compliant :oD
-- Theodore H. Smith - Software Developer - www.elfdata.com/plugin/ Industrial strength string processing code, made easy. (If you believe that's an oxymoron, see for yourself.)
This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 11:01:44 CST