From: Philippe Verdy (email@example.com)
Date: Thu Jan 06 2005 - 12:16:08 CST
May be you should consider decompose first the Unicode string to a
normalized form, or even better, then into UCA collation keys. Then you will
need to adapt the LD algorithm so that it will compute a distance where
collation levels are weighted.
The difficulty will be to determine how you match characters at each
collation level. You could first compute the distance at the primery level,
by the number of insert/delete/replace operations occuring at the primary
collation level. But for the secondary following levels, weighting the
operations will require fractional numbers, so that primary differences will
get a higher LD distance than secondary differences.
Or you may simply consider only the primary level, taking into account only
the priamry collation weight of characters, ignoring characters that are
given a 0 collation weight (that are ignorable for collation, such as some
Unicode/ISO/IEC-10646 formatting controls, like BiDi overrides).
With Unicode texts, you must be very careful with encoded sequences that are
considered canonically equivalent, and whose LD distance should be 0. Or to
strings that are considered compatibility equivalent (and that may be given
a non-0 LD distance lower than the distance that separates two characters
that are not compatibility equivalent and thus not canonically equivalent
----- Original Message -----
From: "Theodore H. Smith" <firstname.lastname@example.org>
To: "ecartis" <email@example.com>
Sent: Thursday, January 06, 2005 5:54 PM
Subject: Unicode and Levenshtein?
> How would I do levenshtein operations on a decomposed Unicode string?
> Here is the problem:
> 1) Levenshtein uses a preallocated 2d matrix.
> 2) Unicode chars can be decomposed, taking a large number of code-points
> per char.
> OK. I'm guessing it would be done like this. First, both strings must be
> split into arrays of codepoints. So instead of 1 string, we now have a 2d
> array of codepoints! Most of the arrays will contain only 1 code-point, as
> Unicode tends to go. But some will contain multiple codepoints.
> So, then instead of reading the string 1 byte at a time, we read it one
> codepoint-array at a time! If the contents of one code-point-array don't
> match the contents of the other codepoint-array, then the "character" is
> seen to be different.
> Thats how it should be done, right?
> In this case, the same solution will work for both UTF8, and UTF32!
> Thus, I can keep my byte-wise code and still be Unicode compliant :oD
> Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
> Industrial strength string processing code, made easy.
> (If you believe that's an oxymoron, see for yourself.)
This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 12:57:13 CST