Re: Normalization Implementation Tricks

From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Feb 12 2009 - 13:35:04 CST

  • Next message: Jeroen Ruigrok van der Werven: "Re: Normalization Implementation Tricks"

    On Thu, Feb 12, 2009 at 10:52 AM, Mark Davis <mark.edward.davis@gmail.com>wrote:

    > For ICU, the trie also contains information on the trailing characters -
    > whether they can combine with any previous character. Only if that is the
    > case is a linear search done in the list of items that is specific to the
    > starter.

    Correct.

    The list is in ascending order of trailing code points, so this favors
    > starters for few composites, and composites with low code points (such as
    > Latin/Greek/Cyrillic, in that order).

    This is what I told Mark, without looking at the code :-} Now I did, and it
    turns out that the list is more or less sorted by trailing code point, not
    by composite code point, so this probably does not help.

    However, ICU applies another important "trick" for the overall
    algorithm: The main NFC/NFKC loop really performs a normalization quick
    check. If the text is already normalized, it's just copied as-is. Only if
    the quick check says "no" or "maybe" does the code take a small substring,
    decomposes it, and recomposes.

    The trie lookup is quite fast. Whether or not some other mechanism would
    > have better performance than the linear search would need some careful
    > testing. That is, when comparing the performance of any of the mechanisms
    > having to to with Unicode, the expected frequency of characters (and in
    > context) plays a huge role in how performance should be measured and tuned
    > for. It may be better to take an O(n) algorithm over an O(1) algorithm
    > depending on the constants involved.
    >
    > To take a simple example, for general-purpose software it is worth
    > decreasing performance for non-BMP characters by 10x if you can increase
    > performance with BMP characters by 1%, because the non-BMP characters are so
    > extremely rare.

    I just added some temporary printf's into the ICU normalization data
    generator (gennorm). The data is interesting:

       - Most ASCII letters are starters, as expected.
       - There are 10 ASCII letters with 10 or more composites each (that is,
       where the linear search may need to look at 10 or more 16-bit entries):
          - u, U: 19
          - e, E: 17
          - a, A, o, O: 16
          - I: 15
          - i: 14
          - y: 10
       - For non-ASCII starters, the maximum number of composites is 8, but most
       of them have no more than 4.

    The linear search loop is extremely tight.

    I am inclined not to spend time on more optimal composition lookup
    algorithms until a performance measurement running the entire NFC function
    (not just the composition lookup itself) on real-world texts shows that
    there is a problem.

    Looking at the data, if I did want to optimize something, I would probably
    add a simple, straight table (rectangular matrix) for the 2*26 ASCII letters
    and the 53 back-combining characters. This would be 52*54 16-bit composites,
    or 5616 bytes. For non-ASCII starters, I would probably keep the linear
    lookup.

    markus



    This archive was generated by hypermail 2.1.5 : Thu Feb 12 2009 - 13:37:08 CST