Re: Unicode Collation performance

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Apr 28 2006 - 18:01:47 CST

  • Next message: Johannes Bergerhausen: "update decodeunicode to 4.1"

    > so that averages to 73,000 per second.

    For comparison, I just wrote a quick&dirty test case with ICU4J, which
    uses UCA 4.1.0. It was run on a Pentium M Processor 760 (2.00GHz, 533MHz
    FSB). According to the test run, it does a string comparison between
    each pair of adjacent strings (123,086 comparisons) at 1,309,436
    comparisons per second. Sortkeys generated per second were 984,704.
    That's in Java; the C version is much faster.

    However, the Unicode conformance test is not the data that the code was
    optimized for. Our design center was:

    A. produce the minimum length (in bytes) for sort keys. This was the metric
    chosen because presumably if you are going to use sort keys, you are going to
    use them many, many times, and also typically store them on disk. Minimizing the
    storage reduces the impact on memory, caches, and disk access, so we are willing
    to take more time to generate shorter versions.

    B. for string comparison, optimize for sorting (common algorithms like
    quicksort) and binary search. Both of these processes end up comparing strings
    that are relatively "close" most of the time, since you progressively "zero in"
    on the area where the string belongs. That is, normally you much more often
    compare "Davis" to "Daniels", "Davies" or "Davidson" than you do to "Smith" or
    "Blackbridge". Because of this, we have special fast paths to optimize for close
    strings.

    C. optimize for 'real data', not random data. That is, use a set of data for a
    number of different languages consisting of common-case data for that language,
    such as personal names.

    There are some charts comparing ICU performance on

    http://icu.sourceforge.net/charts/collation_icu4j_sun.html

    http://icu.sourceforge.net/charts/collation_icu4c_glibc.html

    Here is a quick dump of data for Java:

    File: CollationTest_NON_IGNORABLE.txt
    Punctuation: Normal
    Level: 3

    Total strings: 123,087
    Total code points: 254,602

    COMPARISON
    Total seconds: 0.094
    Comparisons per second: 1,309,436.170212766

    SORTKEY GENERATION
    Total seconds: 0.125
    Sortkey bytes per code point: 4.8919882797
    Sortkeys generated per second: 984,704

    Sample Sort Keys
                     [6] 06 01 05 01 05 00
            a [6] 28 01 05 01 05 00
            A [6] 28 01 05 01 8F 00
            å [7] 28 01 86 99 01 06 00
            fish [9] 32 38 4C 36 01 08 01 08 00

    File: CollationTest_SHIFTED.txt
    Punctuation: Shifted
    Level: 4

    Total strings: 123,087
    Total code points: 254,602

    COMPARISON
    Total seconds: 0.11
    Comparisons per second: 1,118,972.7272727273

    SORTKEY GENERATION
    Total seconds: 0.125
    Sortkey bytes per code point: 5.5544182685
    Sortkeys generated per second: 984,704

    Sample Sort Keys
                     [5] 01 01 01 06 00
            a [8] 28 01 05 01 05 01 23 00
            A [8] 28 01 05 01 8F 01 23 00
            å [9] 28 01 86 99 01 06 01 24 00
            fish [11] 32 38 4C 36 01 08 01 08 01 26 00

    Mark

    Mike wrote:
    > Hello again,
    >
    > Now that I have working code, I'm curious if my
    > performance is any good. I ran both conformance
    > tests and calculated the time spent creating the
    > sort keys and comparing them. For the SHIFTED
    > test, the time required was 1.83 seconds, and the
    > time for the NON_IGNORABLE test was 1.54 seconds.
    > There are roughly 123,000 strings in each file,
    > so that averages to 73,000 per second.
    >
    > The tests were run on a 3.2 GHz Pentium D (Windows
    > XP). The UCA version is 4.1.0.
    >
    > So, am I in serious need of optimization?
    >
    > Thanks for any insight.
    >
    > Mike
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Apr 28 2006 - 18:05:01 CST