Re: NFC FAQ

From: Mark Davis (mark@macchiato.com)
Date: Sun Feb 22 2009 - 18:24:44 CST

  • Next message: Doug Ewell: "Re: NFC FAQ"

    The implementations I tested do revert to fast-paths where possible. Given
    the data:

       - ~99.98% of web HTML page *content* characters are definitely NFC.
          - *Content *means after discarding markup, and doing entity
          resolution.
       - ~99.9999% of web HTML page *markup* characters are definitely NFC.
          - Because so much of markup is plain ASCII.

    an illustrative sample simulating documents would be

       - simulating content:
       - 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other
          Latin, ...) not needing normalization, and
          - 200 characters needing normalization, and
          - simulating markup:
       - 999,999 characters (99.5% being ASCII, ...) not needing normalization,
          and
          - 1 character needing normalization.

    However, since the main issue that the FAQ is aimed at is the normalization
    of identifiers (like XML Name), the two choices are probably as good as any.

    Mark

    On Sun, Feb 22, 2009 at 15:21, Michael D. Adams <mdmkolbe@gmail.com> wrote:

    > First, thank you for putting this up. As an (amateur) implementor
    > this gives me a better feel for what numbers I need to target.
    >
    > However, it would be nice if you could pick samples to test that might
    > give a better feel for the performance parameters of normalization.
    > The "nörmalization" test is good as it shows the performance of the
    > fast-path. But the "No\u0308rmalization" test doesn't really give a
    > good feel for performance as the last eleven characters may or may not
    > have been fast-pathed. Perhaps a few more points varying from
    > completely unfast-pathable (e.g.
    > "o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308") to
    > somewhat fast-pathable might be more helpful.
    >
    > Michael D. Adams
    > mdmkolbe@gmail.com
    >
    > On Thu, Feb 19, 2009 at 1:08 PM, Mark Davis <mark@macchiato.com> wrote:
    > > In response to questions from some people in the W3C, I put together an
    > FAQ
    > > on NFC normalization, at http://www.macchiato.com/unicode/nfc-faq
    > >
    > > I have some figures on performance and footprint in there as examples; if
    > > anyone else has figures from other implementations, I'd appreciate them.
    > >
    > > Mark
    > >
    >



    This archive was generated by hypermail 2.1.5 : Sun Feb 22 2009 - 18:28:37 CST