Re: NFC FAQ

From: Mark Davis (mark@macchiato.com)
Date: Mon Feb 23 2009 - 00:46:18 CST

  • Next message: Asmus Freytag: "Re: NFC FAQ"

    Mark

    On Sun, Feb 22, 2009 at 22:06, Michael D. Adams <mdmkolbe@gmail.com> wrote:

    > I think this depends on who your audience is and what your goals are.
    >
    > If the point is to show that the average case cost of normalization is
    > relatively low then statistical samples representative of the actual
    > data out there are alright. (As you have done.) However the
    > measurements you have posted only show that "when the data can be
    > fast-pathed the implementations are fast".

    And that is bad, why? The fact code can be fast-pathed is typically a key
    tool to achieving performance goals.

    >
    >
    > However, as an implementer I want to know not just the average case
    > but a complete performance model. This means measuring on a number of
    > different sorts of input so that one can start to characterize the
    > performance given a variety of inputs.

    And no one is stopping you from doing so.

    > Measuring inputs that are
    > almost entirely fast-path provides no insight into this.

    Measuring inputs that match reality is vital for producing high-performance
    code. Let's take an example. In ICU we routinely maximize for BMP
    characters. While someone doing extensive work in cuneiform might be
    somewhat slower, the code is faster for the vast majority of cases. So that
    is the right way to code it.

    Even if you really do just want to show that normalization is cheap,
    > then I might still measure a worst-case(*) text for the sake of
    > scientific honesty (you already have a best case and average case
    > text).

    Either the results are still fast in which case the argument
    > that normalization is fast is even stronger, or the results are slow
    > and can be used to underscore the importance of the "99% are NFC"
    > data. In addition not showing the results for the worst-case(*) makes
    > things look suspicious so showing the non-fast-path results even if
    > they aren't very good gives the page more credibility.
    >

    I'm sharing some results that are, I believe, of interest to people
    concerned with normalization. A word with a single unnormalized sequence in
    it is not unrepresentative, when you look at representative samples by
    language; if anything, it overstates how much non-NFC text is there.

    I find being accused of dishonesty somewhat annoying. You are free to make
    your own measurements.

    > Michael D. Adams
    >
    > (*) Ok, maybe not true worst case where there is a string of thousands
    > of combining characters to be sorted. Those aren't just rare, they
    > never happen unless someone is messing with you on purpose (they are
    > impossible in non-artificial text). (Though on second thought those
    > results might still be interesting to show that NFC time wont blow up
    > when hackers start sending oddly formed text at you.)
    >
    > On Sun, Feb 22, 2009 at 7:24 PM, Mark Davis <mark@macchiato.com> wrote:
    > > The implementations I tested do revert to fast-paths where possible.
    > Given
    > > the data:
    > >
    > > ~99.98% of web HTML page content characters are definitely NFC.
    > >
    > > Content means after discarding markup, and doing entity resolution.
    > >
    > > ~99.9999% of web HTML page markup characters are definitely NFC.
    > >
    > > Because so much of markup is plain ASCII.
    > >
    > > an illustrative sample simulating documents would be
    > >
    > > simulating content:
    > >
    > > 999,800 characters (82% being ASCII, then Cyrillic, Han, Arab, other
    > Latin,
    > > ...) not needing normalization, and
    > > 200 characters needing normalization, and
    > >
    > > simulating markup:
    > >
    > > 999,999 characters (99.5% being ASCII, ...) not needing normalization,
    > and
    > > 1 character needing normalization.
    > >
    > > However, since the main issue that the FAQ is aimed at is the
    > normalization
    > > of identifiers (like XML Name), the two choices are probably as good as
    > any.
    > >
    > > Mark
    > >
    > >
    > > On Sun, Feb 22, 2009 at 15:21, Michael D. Adams <mdmkolbe@gmail.com>
    > wrote:
    > >>
    > >> First, thank you for putting this up. As an (amateur) implementor
    > >> this gives me a better feel for what numbers I need to target.
    > >>
    > >> However, it would be nice if you could pick samples to test that might
    > >> give a better feel for the performance parameters of normalization.
    > >> The "nörmalization" test is good as it shows the performance of the
    > >> fast-path. But the "No\u0308rmalization" test doesn't really give a
    > >> good feel for performance as the last eleven characters may or may not
    > >> have been fast-pathed. Perhaps a few more points varying from
    > >> completely unfast-pathable (e.g.
    > >> "o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308o\u0308") to
    > >> somewhat fast-pathable might be more helpful.
    > >>
    > >> Michael D. Adams
    > >> mdmkolbe@gmail.com
    > >>
    > >> On Thu, Feb 19, 2009 at 1:08 PM, Mark Davis <mark@macchiato.com> wrote:
    > >> > In response to questions from some people in the W3C, I put together
    > an
    > >> > FAQ
    > >> > on NFC normalization, at http://www.macchiato.com/unicode/nfc-faq
    > >> >
    > >> > I have some figures on performance and footprint in there as examples;
    > >> > if
    > >> > anyone else has figures from other implementations, I'd appreciate
    > them.
    > >> >
    > >> > Mark
    > >> >
    > >
    > >
    >



    This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 00:48:58 CST