Re: NFC FAQ

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 23 2009 - 13:04:37 CST

  • Next message: Kenneth Whistler: "Re: NFC FAQ"

    > I think the point that David is making is that your numbers only show
    > "optimized performance for the overwhelming majority" and show nothing
    > about "acceptable performance for everything". Since your two sample
    > texts don't test out the badly performing areas of "everything", using
    > only the data presented on your page the reader can not conclude the
    > latter.

    One thing folks concerned about this could do is run benchmarks
    with various implementations over the well-known data set available
    in:

    http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

    That contains content that is *deliberately* very, very far
    from the ~99.98% of web HTML page content in NFC measure that
    Mark has determined empirically for the web as a whole.
    NormalizationTest.txt contains lots of bizarre edge cases and
    lots of non-NFC data, specifically to ensure that implementations
    of Unicode Normalization catch the corner cases.

    As for Asmus' call to "pick one of the languages and one of the
    data formats that give the most scope to actually exercise the
    normalization part of the implementation algorithm", one
    suggestion I would have would be to try focussing on Vietnamese.
    Vietnamese now has a significant (and growing) web presence,
    much of it in UTF-8 (cf. http://www.sgtt.com/vn/), and
    Vietnamese is one of the few major languages widely implemented
    that makes significant use of multiple combining marks with
    a single base character. Furthermore, while opinions vary,
    the preferred representation of Vietnamese is often taken as
    using precomposed characters for all of the basic vowels,
    but then combining marks for the tones -- in that format,
    Vietnamese data would be neither in NFC nor in NFD. So it
    may be possible to turn up significant data corpora for Vietnamese
    which are not in a Unicode normalization form, although the
    impetus for most web data to be in NFC anyway might mean that
    the Vietnamese websites are already skewed that way, despite
    any a priori preferences for text representation.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Feb 23 2009 - 13:19:37 CST