Re: Unicode and Levenshtein?

From: Philippe Verdy (
Date: Thu Jan 06 2005 - 12:16:08 CST

  • Next message: Christopher Fynn: "ISO 10646 & GB18030 repetoire [was: Re: ISO 10646 compliance and EU law]"

    May be you should consider decompose first the Unicode string to a
    normalized form, or even better, then into UCA collation keys. Then you will
    need to adapt the LD algorithm so that it will compute a distance where
    collation levels are weighted.

    The difficulty will be to determine how you match characters at each
    collation level. You could first compute the distance at the primery level,
    by the number of insert/delete/replace operations occuring at the primary
    collation level. But for the secondary following levels, weighting the
    operations will require fractional numbers, so that primary differences will
    get a higher LD distance than secondary differences.

    Or you may simply consider only the primary level, taking into account only
    the priamry collation weight of characters, ignoring characters that are
    given a 0 collation weight (that are ignorable for collation, such as some
    Unicode/ISO/IEC-10646 formatting controls, like BiDi overrides).

    With Unicode texts, you must be very careful with encoded sequences that are
    considered canonically equivalent, and whose LD distance should be 0. Or to
    strings that are considered compatibility equivalent (and that may be given
    a non-0 LD distance lower than the distance that separates two characters
    that are not compatibility equivalent and thus not canonically equivalent

    ----- Original Message -----
    From: "Theodore H. Smith" <>
    To: "ecartis" <>
    Sent: Thursday, January 06, 2005 5:54 PM
    Subject: Unicode and Levenshtein?

    > How would I do levenshtein operations on a decomposed Unicode string?
    > (
    > Here is the problem:
    > 1) Levenshtein uses a preallocated 2d matrix.
    > 2) Unicode chars can be decomposed, taking a large number of code-points
    > per char.
    > OK. I'm guessing it would be done like this. First, both strings must be
    > split into arrays of codepoints. So instead of 1 string, we now have a 2d
    > array of codepoints! Most of the arrays will contain only 1 code-point, as
    > Unicode tends to go. But some will contain multiple codepoints.
    > So, then instead of reading the string 1 byte at a time, we read it one
    > codepoint-array at a time! If the contents of one code-point-array don't
    > match the contents of the other codepoint-array, then the "character" is
    > seen to be different.
    > Thats how it should be done, right?
    > In this case, the same solution will work for both UTF8, and UTF32!
    > Thus, I can keep my byte-wise code and still be Unicode compliant :oD
    > --
    > Theodore H. Smith - Software Developer -
    > Industrial strength string processing code, made easy.
    > (If you believe that's an oxymoron, see for yourself.)

    This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 12:57:13 CST