Re: Unicode Collation performance

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Apr 28 2006 - 18:01:47 CST

Next message: Johannes Bergerhausen: "update decodeunicode to 4.1"

Previous message: Mike: "Unicode Collation performance"
In reply to: Mike: "Unicode Collation performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> so that averages to 73,000 per second.

For comparison, I just wrote a quick&dirty test case with ICU4J, which
uses UCA 4.1.0. It was run on a Pentium M Processor 760 (2.00GHz, 533MHz
FSB). According to the test run, it does a string comparison between
each pair of adjacent strings (123,086 comparisons) at 1,309,436
comparisons per second. Sortkeys generated per second were 984,704.
That's in Java; the C version is much faster.

However, the Unicode conformance test is not the data that the code was
optimized for. Our design center was:

A. produce the minimum length (in bytes) for sort keys. This was the metric
chosen because presumably if you are going to use sort keys, you are going to
use them many, many times, and also typically store them on disk. Minimizing the
storage reduces the impact on memory, caches, and disk access, so we are willing
to take more time to generate shorter versions.

B. for string comparison, optimize for sorting (common algorithms like
quicksort) and binary search. Both of these processes end up comparing strings
that are relatively "close" most of the time, since you progressively "zero in"
on the area where the string belongs. That is, normally you much more often
compare "Davis" to "Daniels", "Davies" or "Davidson" than you do to "Smith" or
"Blackbridge". Because of this, we have special fast paths to optimize for close
strings.

C. optimize for 'real data', not random data. That is, use a set of data for a
number of different languages consisting of common-case data for that language,
such as personal names.

There are some charts comparing ICU performance on

http://icu.sourceforge.net/charts/collation_icu4j_sun.html

http://icu.sourceforge.net/charts/collation_icu4c_glibc.html

Here is a quick dump of data for Java:

File: CollationTest_NON_IGNORABLE.txt
Punctuation: Normal
Level: 3

Total strings: 123,087
Total code points: 254,602

COMPARISON
Total seconds: 0.094
Comparisons per second: 1,309,436.170212766

SORTKEY GENERATION
Total seconds: 0.125
Sortkey bytes per code point: 4.8919882797
Sortkeys generated per second: 984,704

Sample Sort Keys
                 [6] 06 01 05 01 05 00
        a [6] 28 01 05 01 05 00
        A [6] 28 01 05 01 8F 00
        å [7] 28 01 86 99 01 06 00
        fish [9] 32 38 4C 36 01 08 01 08 00

File: CollationTest_SHIFTED.txt
Punctuation: Shifted
Level: 4

Total strings: 123,087
Total code points: 254,602

COMPARISON
Total seconds: 0.11
Comparisons per second: 1,118,972.7272727273

SORTKEY GENERATION
Total seconds: 0.125
Sortkey bytes per code point: 5.5544182685
Sortkeys generated per second: 984,704

Sample Sort Keys
                 [5] 01 01 01 06 00
        a [8] 28 01 05 01 05 01 23 00
        A [8] 28 01 05 01 8F 01 23 00
        å [9] 28 01 86 99 01 06 01 24 00
        fish [11] 32 38 4C 36 01 08 01 08 01 26 00

Mark

Mike wrote:
> Hello again,
>
> Now that I have working code, I'm curious if my
> performance is any good. I ran both conformance
> tests and calculated the time spent creating the
> sort keys and comparing them. For the SHIFTED
> test, the time required was 1.83 seconds, and the
> time for the NON_IGNORABLE test was 1.54 seconds.
> There are roughly 123,000 strings in each file,
> so that averages to 73,000 per second.
>
> The tests were run on a 3.2 GHz Pentium D (Windows
> XP). The UCA version is 4.1.0.
>
> So, am I in serious need of optimization?
>
> Thanks for any insight.
>
> Mike
>
>
>

Next message: Johannes Bergerhausen: "update decodeunicode to 4.1"
Previous message: Mike: "Unicode Collation performance"
In reply to: Mike: "Unicode Collation performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Apr 28 2006 - 18:05:01 CST