From: Asmus Freytag
Date: Mon Feb 23 2009 - 01:25:29 CST

    On 2/22/2009 6:13 PM, Doug Ewell wrote:
    > Mark Davis wrote:
    >> an illustrative sample simulating documents would be
    >> simulating content:
    >> 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other
    >> Latin, ...) not needing normalization, and
    >> 200 characters needing normalization,
    > If you did happen to run into some data that started out in NFD --
    > say, generated on a Mac -- you'd have a lot more than 0.02% of content
    > characters needing normalization.
    I think it would be worthwhile to collect what I would call "reasonable
    worst case" examples.

    For an example to be reasonable, it would have to have data that are
    typical for a certain language, with character distributions typical for
    larger corpora in that language. It would also have to have a reasonable
    assumption as for origin, for that, something like "NFD data created on
    the Mac" would qualify, but for some languages there may be other data
    formats that might require reordering, and which could be a worse case.

    For "worst case" one would then pick one of the languages and one of the
    data formats that give the most scope to actually exercise the
    normalization part of the implementation algorithm. NFD and unnormalized
    data might stress the implementation differently.

    With such sample cases it would be possible to estimate "reasonable
    worst case behavior" for various implementation strategies.

    French data in NFD might require simple combination for about 10% of the
    characters (very rough guess), but probably no reordering. Some South
    Asian data in keyboard order might need reorderings, but for what
    percentage of characters I can't estimate.

    The point of such exercise would be to make sure that implementations
    are fast enough when faced with data that for one reason or another,
    happen to selectively be similar to one of these "reasonable worst cases".


