Re: Unicode and Levenshtein?

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Jan 06 2005 - 18:10:23 CST

Next message: Theodore H. Smith: "Re: Unicode and Levenshtein?"

Previous message: Mark Davis: "Re: Re: ISO 10646 compliance and EU law"
In reply to: Theodore H. Smith: "Unicode and Levenshtein?"
Next in thread: Theodore H. Smith: "Re: Unicode and Levenshtein?"
Reply: Theodore H. Smith: "Re: Unicode and Levenshtein?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The simplest would be to use the NFD form of the string, and performing the
same calculations as you would otherwise, character by character.

The alternative would be to work with the Unicode Collation Algorithm data.
That will probably give you better results. For example, it would count
"abc" vs "Abc" as a much smaller difference than versus "zbc", and a smaller
difference than versus "äbc". However, it would take some work to adapt the
algorithm, since the UCA data provides levels of differences between
characters that you would have to weight appropriately in the algorithm. For
example, you'd have to determine how to weight "ab" vs "bä", probably
counting that as the weight of a transposition plus the weight of an
accentless match.

(If you are interested, you can use ICU to get both UCA data and NFD
transforms, in C, C++, and Java. For all the major search engines except
Google, "ICU Unicode" gets you to ICU as the first choice. Very odd that
Google doesn't -- maybe they don't like us ;-)

‎Mark

----- Original Message -----
From: "Theodore H. Smith" <delete@elfdata.com>
To: "ecartis" <unicode@unicode.org>
Sent: Thursday, January 06, 2005 08:54
Subject: Unicode and Levenshtein?

>
> How would I do levenshtein operations on a decomposed Unicode string?
> (http://www.merriampark.com/ld.htm)
>
> Here is the problem:
> 1) Levenshtein uses a preallocated 2d matrix.
> 2) Unicode chars can be decomposed, taking a large number of
> code-points per char.
>
> OK. I'm guessing it would be done like this. First, both strings must
> be split into arrays of codepoints. So instead of 1 string, we now have
> a 2d array of codepoints! Most of the arrays will contain only 1
> code-point, as Unicode tends to go. But some will contain multiple
> codepoints.
>
> So, then instead of reading the string 1 byte at a time, we read it one
> codepoint-array at a time! If the contents of one code-point-array
> don't match the contents of the other codepoint-array, then the
> "character" is seen to be different.
>
> Thats how it should be done, right?
>
> In this case, the same solution will work for both UTF8, and UTF32!
>
> Thus, I can keep my byte-wise code and still be Unicode compliant :oD
>
>
> --
> Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
> Industrial strength string processing code, made easy.
> (If you believe that's an oxymoron, see for yourself.)
>
>
>

Next message: Theodore H. Smith: "Re: Unicode and Levenshtein?"
Previous message: Mark Davis: "Re: Re: ISO 10646 compliance and EU law"
In reply to: Theodore H. Smith: "Unicode and Levenshtein?"
Next in thread: Theodore H. Smith: "Re: Unicode and Levenshtein?"
Reply: Theodore H. Smith: "Re: Unicode and Levenshtein?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 06 2005 - 18:15:30 CST